Designed and implemented an end-to-end LLM + RAG analytics pipeline using PyTorch, PySpark, and large-news corpora to analyze financial sentiment and predict short-term stock movement. Integrated segmentation, contextual retrieval, prompt engineering, and quantized inference to build a scalable data-science system capable of batch and streaming analysis.
The project processed millions of news articles using Spark distributed workers, performing entity-specific segmentation (company, sector, event type), filtering out noise and combining headlines with contextual information retrieved from a RAG knowledge store. Each enriched sample was then fed through a quantized LLM sentiment model trained with PyTorch, achieving 73% accuracy in financial sentiment classification.
Inspired by the System for Data Science course, the pipeline incorporated cluster-level coordination, distributed scheduling, and cached batch execution to improve throughput. Using Spark UDFs, LLM inference was dispatched across worker nodes, while caching strategies prevented repeated model loads. A custom retrieval module integrated embeddings, BM25 search, and metadata filtering to surface relevant context for each financial event.
Evaluations were recorded in the Notion project log, including latency measurements, Spark shuffle optimization, quantization experiments, RAG ablation tests, and retrieval-augmentation quality comparisons. Visual dashboards were created to track segment-specific accuracy, prediction confidence, failure cases, and how context windows impacted the final LLM output.