TradePulse

Real-time market data pipeline processing 15,000+ events per second with sub-100ms latency

View on GitHub Read the Article Live Dashboard →

0
events/sec

0
p50 latency (ms)

0
p99 latency

0
anomalies

What is TradePulse?

Stock market data is one of the hardest domains for data engineering — high volume, zero tolerance for duplicates, strict latency requirements, and burst traffic at market open that can spike 8–10x in under 60 seconds. Most portfolio projects sidestep these challenges. TradePulse confronts them directly.

TradePulse ingests live market trades and quotes from Polygon.io via WebSocket, streams them through Apache Kafka, processes windowed aggregations (VWAP, volume z-score, price momentum) using Faust, runs real-time anomaly detection using Isolation Forest, and stores results in a hot/cold AWS architecture (DynamoDB for recent data, S3 + Parquet for historical). A second stream ingests real-time news headlines from Finnhub, scores each headline using VADER sentiment analysis in under 1ms, and correlates sentiment with volume z-score using a 60-second temporal join — surfacing high-confidence news-driven market events in real time.

The engineering patterns in TradePulse — exactly-once semantics, dead letter queues, backpressure mechanisms, write sharding — are the same patterns used by Robinhood, Coinbase, and the data infrastructure teams at Google and Amazon. This project exists to demonstrate that these patterns can be understood, implemented, and explained by a single engineer.

⚡

Production Patterns

DLQ, backpressure, exactly-once, write sharding — not just happy-path code.

📚

Fully Documented

Architecture, runbook, schema, postmortem, benchmarks — all included.

How It Works

Data flows from live market feeds through a fault-tolerant pipeline to queryable storage in under 100ms

Main stream: trades

News stream: sentiment + correlation (join on ticker + 60s window)

Event lifecycle

0ms recv2ms Kafka4ms validate8ms Faust67ms p99

Getting Started

Up and running in under 5 minutes

Prerequisites

Docker and Docker Compose installed
Polygon.io account (free tier works)
AWS account with DynamoDB, S3, SQS access
Python 3.11+ (for running tests locally)

Clone the repository

git clone https://github.com/nikhilgiridharan/TradePulse
cd TradePulse

Configure environment

Copy .env.example to .env and set POLYGON_API_KEY, AWS credentials, and AWS_REGION (e.g. us-east-1).

cp .env.example .env

Start the full stack

make up

Starts Kafka, Zookeeper, Schema Registry, Producer, Processing, and API.

Verify health

curl http://localhost:8000/health

Expected: {"status":"healthy", ...}

Open the dashboard

Navigate to http://localhost:8000

API Reference

6 endpoints. All responses in JSON. Rate limited to 100 requests/minute.

GET /quotes/{ticker}

Returns the latest trade for a given ticker symbol.

Parameter	Description
ticker (required)	Stock symbol e.g. AAPL, TSLA

Cache: 1 second

curl http://localhost:8000/quotes/AAPL

Response 200: ticker, price, volume, timestamp, sequence_number

GET /aggregations/{ticker}

Real-time VWAP, rolling averages, volume z-score, and price momentum.

Cache: 5 seconds

volume_zscore > 2.0 indicates unusual volume. > 3.0 indicates extreme activity.

GET /anomalies/{ticker}?hours=24

Anomaly detections from the Isolation Forest model for the last N hours.

Query param: hours (default 24, max 168). Cache: none.

anomaly_score more negative = more anomalous. Threshold: -0.500

GET /features/{ticker}

Current real-time feature vector from the feature store. Cache: 1 second.

GET /health

Pipeline health: status, consumer_lag, dynamo_latency_p99, uptime_seconds, checked_at.

status: "healthy" — normal; "degraded" — elevated latency; "unhealthy" — not processing.

GET /sentiment/{ticker}?hours=24

Returns recent news sentiment analysis with market correlation data. Shows which headlines coincided with unusual volume activity — the output of the two-stream join.

Parameter	Description
ticker (required)	Stock symbol (AAPL, MSFT, AMZN, TSLA, NVDA)
hours (optional)	Lookback window in hours (default 24, max 168)

Cache: none (always fresh)

curl http://localhost:8000/sentiment/NVDA?hours=4

correlation_strength values: strong | moderate | weak | none. Strong correlations are the highest-signal events in the system.

Key Engineering Concepts

The patterns that make TradePulse production-grade

Exactly-Once Semantics

How TradePulse guarantees no duplicate data, even when things crash

▼

The Problem: Retries cause duplicate writes; the same trade written twice corrupts every aggregation downstream.

The Solution (3 layers): (1) Kafka producer with enable.idempotence = True. (2) Faust processing_guarantee = 'exactly_once'. (3) DynamoDB ConditionExpression so only write if item doesn't exist. On ConditionalCheckFailedException treat as success. Crash scenario: Faust writes to DynamoDB, crashes before committing offset; on replay DynamoDB rejects duplicate; offset committed. Result: processed exactly once.

Dead Letter Queue (DLQ)

Why TradePulse never silently drops a failed message

▼

Invalid or failed messages go to SQS DLQ with full context. Retry strategy: 3 attempts → SQS DLQ → retry every 15 minutes → archive to S3 after 5 DLQ retries. Fifteen minutes gives transient failures (network, throttling) time to resolve without hammering the system.

DynamoDB Write Sharding

How TradePulse prevents hot partitions at market open

▼

Naive key "AAPL" routes all AAPL writes to one partition → throttling at ~800 writes/sec. Shard key: ticker#shard_0-7, shard = hash(ticker+timestamp) % 8. Writes distributed across 8 partitions → ~100/sec each.

Naive key "AAPL":          Shard key "AAPL#0".."AAPL#7":
All writes → 1 partition   Writes distributed:
[████████████] THROTTLE    [██] AAPL#0  ✓
                           [██] AAPL#1  ✓
                           [██] AAPL#2  ✓  ... 8 shards total

Backpressure

How the pipeline self-regulates when downstream systems slow down

▼

Without backpressure, slow DynamoDB causes unbounded memory growth. Solution: if write latency > 100ms for recent writes, pause consumption 500ms. Kafka consumer lag growing is safe (Kafka holds data durably); memory growth is not.

Hot/Cold Storage

The right storage for each access pattern

▼

Hot (DynamoDB): Last 48 hours, single-digit ms reads, TTL expiry. Cold (S3 + Parquet): Historical data, Hive partitioning, Snappy compression, Athena SQL.

Real-Time Anomaly Detection

ML inference at 0.3ms per event

▼

Isolation Forest with 6 features (price, volume, volume_zscore, price_momentum, vwap_deviation, trade_frequency). Rolling retraining every 500 events on last 1000. Contamination = 0.01.

Two-Stream Join

How TradePulse correlates news sentiment with market activity in real time

▼

The Problem: A news headline alone is weak signal. Most articles don't move markets. A volume spike alone is ambiguous — it could be institutional rebalancing, an index inclusion, or random noise. Neither stream is particularly useful on its own.

The Insight: When a news article with strong sentiment AND unusual market volume occur within 60 seconds of each other, the probability that the market is reacting to the news in real time is significantly higher. This is the joint signal that matters.

How It Works:

Stream 1 — Equity trades (market.trades): Polygon.io WebSocket → Kafka → Faust → computes volume z-score per ticker. Z-score cached in memory with timestamp for fast correlation lookups.

Stream 2 — News headlines (market.news): Finnhub API polling every 60s → Kafka → VADER sentiment analysis. On each article: look up ticker's z-score cache.

Join logic (60-second temporal window): IF published_at is within 60 seconds of a z-score > 2.0: correlation detected. strength = f(sentiment_magnitude, z-score_magnitude). strong: |sentiment| >= 0.5 AND z-score >= 3.0. moderate: |sentiment| >= 0.2 AND z-score >= 2.0. weak: correlated but low magnitude.

Why VADER over FinBERT: VADER: <1ms inference, rule-based, no GPU required. FinBERT: 200-400ms inference, requires GPU for production throughput. For real-time stream processing on a single server, VADER's speed tradeoff is correct. FinBERT could enrich high-signal events async without blocking the main stream — a natural future extension.

Why 60 seconds: Algorithmic traders typically react to headlines within seconds. Human traders within minutes. A 60-second window captures the initial market reaction while excluding longer-term price drift that may have other causes.

Tech Stack

Every technology chosen for a specific reason

🔴

Kafka

v7.5.0

Chosen over Kinesis for replay capability and richer ecosystem.

🐍

Faust

v0.10.14

Python-native stream processing; lighter than Spark for this scale.

⚡

DynamoDB

Zero-ops managed service; conditional writes for idempotency.

🪣

S3 + Parquet

Columnar format and Snappy for query efficiency and cost.

🤖

Scikit-learn

v1.4.0

Isolation Forest for multi-dimensional anomaly detection.

🚀

FastAPI

v0.109.0

Async, auto OpenAPI, Pydantic validation built-in.

🐳

Docker Compose

One-command local stack; production parity.

📊

CloudWatch

Metrics, alarms, dashboards for all components.

📰

Finnhub

v2.4.19

Free real-time news API. Chosen over NewsAPI for company-level filtering and higher free tier limits. 60-second polling sufficient for news frequency.

🧠

VADER Sentiment

v3.3.2

Rule-based NLP scoring <1ms per headline. Chosen over FinBERT (200-400ms) for real-time stream latency requirements. Compound score: -1.0 to +1.0.

Performance Benchmarks

Measured under sustained load and 3x burst conditions

Metric	Result	Status
Sustained throughput	14,800 e/sec	✓ Target
Peak throughput (30s burst)	31,200 e/sec	✓ 2.1x base
End-to-end latency p50	12ms	✓ <100ms
End-to-end latency p99	67ms	✓ <100ms
DynamoDB write latency p99	18ms	✓ Healthy
Kafka consumer lag (sustained)	<50 messages	✓ Normal
Anomaly detection inference	0.3ms/event	✓ Real-time

Test methodology: Custom Python load test simulating market-open burst across 5 tickers. Duration: 10 min sustained + 30s burst at 3x. Measurement: WebSocket receive to DynamoDB write confirmed.

Project Structure

Organized for clarity and separation of concerns

TradePulse/

docker-compose.yml
requirements.txt
Makefile
README.md
src/
- config.py
- producer/
  - polygon_producer.py
- processing/
  - faust_app.py
  - aggregations.py
  - anomaly_detection.py
  - feature_store.py
- validation/
  - schema_validator.py
- storage/
  - dynamo_writer.py
  - s3_writer.py
  - dlq_handler.py
- api/
  - main.py
  - static/index.html
  - static/about.html
- monitoring/
  - cloudwatch_metrics.py
tests/
- conftest.py
- unit/
- integration/
docs/
- architecture.md
- schema.md
- benchmarks.md
- runbook.md
- features.md
- postmortem.md

About the Author

Nikhil Giridharan

Data Engineer

Building production-grade data pipelines and streaming systems. Currently working with Databricks, Apache Kafka, and AWS at scale.

GitHub LinkedIn Medium Portfolio

github.com/nikhilgiridharan · nikhilgiridharan.com