Real-time market data pipeline processing 15,000+ events per second with sub-100ms latency
Stock market data is one of the hardest domains for data engineering โ high volume, zero tolerance for duplicates, strict latency requirements, and burst traffic at market open that can spike 8โ10x in under 60 seconds. Most portfolio projects sidestep these challenges. TradePulse confronts them directly.
TradePulse ingests live market trades and quotes from Polygon.io via WebSocket, streams them through Apache Kafka, processes windowed aggregations (VWAP, volume z-score, price momentum) using Faust, runs real-time anomaly detection using Isolation Forest, and stores results in a hot/cold AWS architecture (DynamoDB for recent data, S3 + Parquet for historical). A second stream ingests real-time news headlines from Finnhub, scores each headline using VADER sentiment analysis in under 1ms, and correlates sentiment with volume z-score using a 60-second temporal join โ surfacing high-confidence news-driven market events in real time.
The engineering patterns in TradePulse โ exactly-once semantics, dead letter queues, backpressure mechanisms, write sharding โ are the same patterns used by Robinhood, Coinbase, and the data infrastructure teams at Google and Amazon. This project exists to demonstrate that these patterns can be understood, implemented, and explained by a single engineer.
DLQ, backpressure, exactly-once, write sharding โ not just happy-path code.
Architecture, runbook, schema, postmortem, benchmarks โ all included.
Main stream: trades
News stream: sentiment + correlation (join on ticker + 60s window)
git clone https://github.com/nikhilgiridharan/TradePulse cd TradePulse
Copy .env.example to .env and set POLYGON_API_KEY, AWS credentials, and AWS_REGION (e.g. us-east-1).
cp .env.example .env
make up
Starts Kafka, Zookeeper, Schema Registry, Producer, Processing, and API.
curl http://localhost:8000/health
Expected: {"status":"healthy", ...}
Returns the latest trade for a given ticker symbol.
| Parameter | Description |
|---|---|
| ticker (required) | Stock symbol e.g. AAPL, TSLA |
Cache: 1 second
curl http://localhost:8000/quotes/AAPL
Response 200: ticker, price, volume, timestamp, sequence_number
Real-time VWAP, rolling averages, volume z-score, and price momentum.
Cache: 5 seconds
volume_zscore > 2.0 indicates unusual volume. > 3.0 indicates extreme activity.
Anomaly detections from the Isolation Forest model for the last N hours.
Query param: hours (default 24, max 168). Cache: none.
anomaly_score more negative = more anomalous. Threshold: -0.500
Current real-time feature vector from the feature store. Cache: 1 second.
Pipeline health: status, consumer_lag, dynamo_latency_p99, uptime_seconds, checked_at.
status: "healthy" โ normal; "degraded" โ elevated latency; "unhealthy" โ not processing.
Returns recent news sentiment analysis with market correlation data. Shows which headlines coincided with unusual volume activity โ the output of the two-stream join.
| Parameter | Description |
|---|---|
| ticker (required) | Stock symbol (AAPL, MSFT, AMZN, TSLA, NVDA) |
| hours (optional) | Lookback window in hours (default 24, max 168) |
Cache: none (always fresh)
curl http://localhost:8000/sentiment/NVDA?hours=4
correlation_strength values: strong | moderate | weak | none. Strong correlations are the highest-signal events in the system.
The Problem: Retries cause duplicate writes; the same trade written twice corrupts every aggregation downstream.
The Solution (3 layers): (1) Kafka producer with enable.idempotence = True. (2) Faust processing_guarantee = 'exactly_once'. (3) DynamoDB ConditionExpression so only write if item doesn't exist. On ConditionalCheckFailedException treat as success. Crash scenario: Faust writes to DynamoDB, crashes before committing offset; on replay DynamoDB rejects duplicate; offset committed. Result: processed exactly once.
Invalid or failed messages go to SQS DLQ with full context. Retry strategy: 3 attempts โ SQS DLQ โ retry every 15 minutes โ archive to S3 after 5 DLQ retries. Fifteen minutes gives transient failures (network, throttling) time to resolve without hammering the system.
Naive key "AAPL" routes all AAPL writes to one partition โ throttling at ~800 writes/sec. Shard key: ticker#shard_0-7, shard = hash(ticker+timestamp) % 8. Writes distributed across 8 partitions โ ~100/sec each.
Naive key "AAPL": Shard key "AAPL#0".."AAPL#7":
All writes โ 1 partition Writes distributed:
[โโโโโโโโโโโโ] THROTTLE [โโ] AAPL#0 โ
[โโ] AAPL#1 โ
[โโ] AAPL#2 โ ... 8 shards total
Without backpressure, slow DynamoDB causes unbounded memory growth. Solution: if write latency > 100ms for recent writes, pause consumption 500ms. Kafka consumer lag growing is safe (Kafka holds data durably); memory growth is not.
Hot (DynamoDB): Last 48 hours, single-digit ms reads, TTL expiry. Cold (S3 + Parquet): Historical data, Hive partitioning, Snappy compression, Athena SQL.
Isolation Forest with 6 features (price, volume, volume_zscore, price_momentum, vwap_deviation, trade_frequency). Rolling retraining every 500 events on last 1000. Contamination = 0.01.
The Problem: A news headline alone is weak signal. Most articles don't move markets. A volume spike alone is ambiguous โ it could be institutional rebalancing, an index inclusion, or random noise. Neither stream is particularly useful on its own.
The Insight: When a news article with strong sentiment AND unusual market volume occur within 60 seconds of each other, the probability that the market is reacting to the news in real time is significantly higher. This is the joint signal that matters.
How It Works:
Stream 1 โ Equity trades (market.trades): Polygon.io WebSocket โ Kafka โ Faust โ computes volume z-score per ticker. Z-score cached in memory with timestamp for fast correlation lookups.
Stream 2 โ News headlines (market.news): Finnhub API polling every 60s โ Kafka โ VADER sentiment analysis. On each article: look up ticker's z-score cache.
Join logic (60-second temporal window): IF published_at is within 60 seconds of a z-score > 2.0: correlation detected. strength = f(sentiment_magnitude, z-score_magnitude). strong: |sentiment| >= 0.5 AND z-score >= 3.0. moderate: |sentiment| >= 0.2 AND z-score >= 2.0. weak: correlated but low magnitude.
Why VADER over FinBERT: VADER: <1ms inference, rule-based, no GPU required. FinBERT: 200-400ms inference, requires GPU for production throughput. For real-time stream processing on a single server, VADER's speed tradeoff is correct. FinBERT could enrich high-signal events async without blocking the main stream โ a natural future extension.
Why 60 seconds: Algorithmic traders typically react to headlines within seconds. Human traders within minutes. A 60-second window captures the initial market reaction while excluding longer-term price drift that may have other causes.
Chosen over Kinesis for replay capability and richer ecosystem.
Python-native stream processing; lighter than Spark for this scale.
Zero-ops managed service; conditional writes for idempotency.
Columnar format and Snappy for query efficiency and cost.
Isolation Forest for multi-dimensional anomaly detection.
Async, auto OpenAPI, Pydantic validation built-in.
One-command local stack; production parity.
Metrics, alarms, dashboards for all components.
Free real-time news API. Chosen over NewsAPI for company-level filtering and higher free tier limits. 60-second polling sufficient for news frequency.
Rule-based NLP scoring <1ms per headline. Chosen over FinBERT (200-400ms) for real-time stream latency requirements. Compound score: -1.0 to +1.0.
| Metric | Result | Status |
|---|---|---|
| Sustained throughput | 14,800 e/sec | โ Target |
| Peak throughput (30s burst) | 31,200 e/sec | โ 2.1x base |
| End-to-end latency p50 | 12ms | โ <100ms |
| End-to-end latency p99 | 67ms | โ <100ms |
| DynamoDB write latency p99 | 18ms | โ Healthy |
| Kafka consumer lag (sustained) | <50 messages | โ Normal |
| Anomaly detection inference | 0.3ms/event | โ Real-time |
Test methodology: Custom Python load test simulating market-open burst across 5 tickers. Duration: 10 min sustained + 30s burst at 3x. Measurement: WebSocket receive to DynamoDB write confirmed.