System for Data Science

⟵ Home

LLM Pipelines RAG Systems PyTorch PySpark Distributed Tasks Stock Trend Prediction

LLM-integrated distributed analytics system

Designed and implemented an end-to-end LLM + RAG analytics pipeline using PyTorch, PySpark, and large-news corpora to analyze financial sentiment and predict short-term stock movement. Integrated segmentation, contextual retrieval, prompt engineering, and quantized inference to build a scalable data-science system capable of batch and streaming analysis.

The project processed millions of news articles using Spark distributed workers, performing entity-specific segmentation (company, sector, event type), filtering out noise and combining headlines with contextual information retrieved from a RAG knowledge store. Each enriched sample was then fed through a quantized LLM sentiment model trained with PyTorch, achieving 73% accuracy in financial sentiment classification.

Inspired by the System for Data Science course, the pipeline incorporated cluster-level coordination, distributed scheduling, and cached batch execution to improve throughput. Using Spark UDFs, LLM inference was dispatched across worker nodes, while caching strategies prevented repeated model loads. A custom retrieval module integrated embeddings, BM25 search, and metadata filtering to surface relevant context for each financial event.

Evaluations were recorded in the Notion project log, including latency measurements, Spark shuffle optimization, quantization experiments, RAG ablation tests, and retrieval-augmentation quality comparisons. Visual dashboards were created to track segment-specific accuracy, prediction confidence, failure cases, and how context windows impacted the final LLM output.

Key Contributions

Built a PySpark pipeline for large-scale ingestion, segmentation, and structured transformation.
Integrated quantized PyTorch LLMs for efficient financial sentiment classification.
Designed a RAG module combining dense embeddings + BM25 + metadata filters.
Implemented distributed LLM inference using Spark UDFs and cluster resource scheduling.
Created dashboards tracking accuracy, latency, segment statistics, and model drift.
Developed a predictive module for short-term stock movement classification based on enriched sentiment signals.

Notion Project Log YouTube Demo Go Home