TradePulse API v1.0.0
14,832 events/sec
12ms p50
โœ“ Operational
GitHub Dashboard

TradePulse

Real-time market data pipeline processing 15,000+ events per second with sub-100ms latency

0
events/sec
0
p50 latency (ms)
0
p99 latency
0
anomalies

What is TradePulse?

Stock market data is one of the hardest domains for data engineering โ€” high volume, zero tolerance for duplicates, strict latency requirements, and burst traffic at market open that can spike 8โ€“10x in under 60 seconds. Most portfolio projects sidestep these challenges. TradePulse confronts them directly.

TradePulse ingests live market trades and quotes from Polygon.io via WebSocket, streams them through Apache Kafka, processes windowed aggregations (VWAP, volume z-score, price momentum) using Faust, runs real-time anomaly detection using Isolation Forest, and stores results in a hot/cold AWS architecture (DynamoDB for recent data, S3 + Parquet for historical). A second stream ingests real-time news headlines from Finnhub, scores each headline using VADER sentiment analysis in under 1ms, and correlates sentiment with volume z-score using a 60-second temporal join โ€” surfacing high-confidence news-driven market events in real time.

The engineering patterns in TradePulse โ€” exactly-once semantics, dead letter queues, backpressure mechanisms, write sharding โ€” are the same patterns used by Robinhood, Coinbase, and the data infrastructure teams at Google and Amazon. This project exists to demonstrate that these patterns can be understood, implemented, and explained by a single engineer.

โšก

Production Patterns

DLQ, backpressure, exactly-once, write sharding โ€” not just happy-path code.

๐Ÿ“š

Fully Documented

Architecture, runbook, schema, postmortem, benchmarks โ€” all included.

How It Works

Data flows from live market feeds through a fault-tolerant pipeline to queryable storage in under 100ms

Main stream: trades

WebSocket connection to live trade and quote data. Reconnects automatically with exponential backoff + jitter on disconnect.Polygon.ioWebSocket 6-partition topic buffers events durably. Decouples producer from consumers โ€” if DynamoDB slows, Kafka absorbs the backlog.Kafka6 partitions Pydantic schema validation on every event. Invalid events route to SQS Dead Letter Queue rather than being dropped silently.ValidatorPydantic + DLQ Stream processing with exactly-once guarantees. Computes VWAP, volume z-score, momentum. Runs Isolation Forest anomaly detection.FaustVWAP ยท ML Hot storage with shard key pattern (ticker#shard_0-7) to prevent hot partitions. TTL-based retention. Conditional writes for idempotency.DynamoDBHot ยท 48h TTL Cold storage with Hive partitioning for Athena query efficiency. Snappy compression. Buffered writes flushed every 5 minutes.S3 Parquet

News stream: sentiment + correlation (join on ticker + 60s window)

Polls Finnhub REST API every 60 seconds per ticker. Deduplicates articles using a bounded seen-IDs set. Publishes to market.news Kafka topic partitioned by ticker.Finnhub API60s poll Kafkamarket.news VADER rule-based sentiment analysis scores each headline in <1ms. Chosen over FinBERT for stream latency โ€” 1ms vs 200-400ms. Compound score range: -1.0 (bearish) to +1.0 (bullish).VADER Sentiment<1ms/headline Joins news sentiment with volume z-score using a 60-second temporal window. Strong correlation = high-magnitude sentiment + z-score > 3.0 within 60 seconds. This is the signal that matters.Correlator60s window DynamoDBmarket_sentiment

Event lifecycle

0ms recv2ms Kafka4ms validate8ms Faust67ms p99

Getting Started

Up and running in under 5 minutes

Prerequisites

  • Docker and Docker Compose installed
  • Polygon.io account (free tier works)
  • AWS account with DynamoDB, S3, SQS access
  • Python 3.11+ (for running tests locally)
1

Clone the repository

git clone https://github.com/nikhilgiridharan/TradePulse
cd TradePulse
2

Configure environment

Copy .env.example to .env and set POLYGON_API_KEY, AWS credentials, and AWS_REGION (e.g. us-east-1).

cp .env.example .env
3

Start the full stack

make up

Starts Kafka, Zookeeper, Schema Registry, Producer, Processing, and API.

4

Verify health

curl http://localhost:8000/health

Expected: {"status":"healthy", ...}

5

Open the dashboard

Navigate to http://localhost:8000

API Reference

6 endpoints. All responses in JSON. Rate limited to 100 requests/minute.

GET /quotes/{ticker}

Returns the latest trade for a given ticker symbol.

ParameterDescription
ticker (required)Stock symbol e.g. AAPL, TSLA

Cache: 1 second

curl http://localhost:8000/quotes/AAPL

Response 200: ticker, price, volume, timestamp, sequence_number

GET /aggregations/{ticker}

Real-time VWAP, rolling averages, volume z-score, and price momentum.

Cache: 5 seconds

volume_zscore > 2.0 indicates unusual volume. > 3.0 indicates extreme activity.

GET /anomalies/{ticker}?hours=24

Anomaly detections from the Isolation Forest model for the last N hours.

Query param: hours (default 24, max 168). Cache: none.

anomaly_score more negative = more anomalous. Threshold: -0.500

GET /features/{ticker}

Current real-time feature vector from the feature store. Cache: 1 second.

GET /health

Pipeline health: status, consumer_lag, dynamo_latency_p99, uptime_seconds, checked_at.

status: "healthy" โ€” normal; "degraded" โ€” elevated latency; "unhealthy" โ€” not processing.

GET /sentiment/{ticker}?hours=24

Returns recent news sentiment analysis with market correlation data. Shows which headlines coincided with unusual volume activity โ€” the output of the two-stream join.

ParameterDescription
ticker (required)Stock symbol (AAPL, MSFT, AMZN, TSLA, NVDA)
hours (optional)Lookback window in hours (default 24, max 168)

Cache: none (always fresh)

curl http://localhost:8000/sentiment/NVDA?hours=4

correlation_strength values: strong | moderate | weak | none. Strong correlations are the highest-signal events in the system.

Key Engineering Concepts

The patterns that make TradePulse production-grade

Exactly-Once Semantics

How TradePulse guarantees no duplicate data, even when things crash
โ–ผ

The Problem: Retries cause duplicate writes; the same trade written twice corrupts every aggregation downstream.

The Solution (3 layers): (1) Kafka producer with enable.idempotence = True. (2) Faust processing_guarantee = 'exactly_once'. (3) DynamoDB ConditionExpression so only write if item doesn't exist. On ConditionalCheckFailedException treat as success. Crash scenario: Faust writes to DynamoDB, crashes before committing offset; on replay DynamoDB rejects duplicate; offset committed. Result: processed exactly once.

Dead Letter Queue (DLQ)

Why TradePulse never silently drops a failed message
โ–ผ

Invalid or failed messages go to SQS DLQ with full context. Retry strategy: 3 attempts โ†’ SQS DLQ โ†’ retry every 15 minutes โ†’ archive to S3 after 5 DLQ retries. Fifteen minutes gives transient failures (network, throttling) time to resolve without hammering the system.

DynamoDB Write Sharding

How TradePulse prevents hot partitions at market open
โ–ผ

Naive key "AAPL" routes all AAPL writes to one partition โ†’ throttling at ~800 writes/sec. Shard key: ticker#shard_0-7, shard = hash(ticker+timestamp) % 8. Writes distributed across 8 partitions โ†’ ~100/sec each.

Naive key "AAPL":          Shard key "AAPL#0".."AAPL#7":
All writes โ†’ 1 partition   Writes distributed:
[โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] THROTTLE    [โ–ˆโ–ˆ] AAPL#0  โœ“
                           [โ–ˆโ–ˆ] AAPL#1  โœ“
                           [โ–ˆโ–ˆ] AAPL#2  โœ“  ... 8 shards total

Backpressure

How the pipeline self-regulates when downstream systems slow down
โ–ผ

Without backpressure, slow DynamoDB causes unbounded memory growth. Solution: if write latency > 100ms for recent writes, pause consumption 500ms. Kafka consumer lag growing is safe (Kafka holds data durably); memory growth is not.

Hot/Cold Storage

The right storage for each access pattern
โ–ผ

Hot (DynamoDB): Last 48 hours, single-digit ms reads, TTL expiry. Cold (S3 + Parquet): Historical data, Hive partitioning, Snappy compression, Athena SQL.

Real-Time Anomaly Detection

ML inference at 0.3ms per event
โ–ผ

Isolation Forest with 6 features (price, volume, volume_zscore, price_momentum, vwap_deviation, trade_frequency). Rolling retraining every 500 events on last 1000. Contamination = 0.01.

Two-Stream Join

How TradePulse correlates news sentiment with market activity in real time
โ–ผ

The Problem: A news headline alone is weak signal. Most articles don't move markets. A volume spike alone is ambiguous โ€” it could be institutional rebalancing, an index inclusion, or random noise. Neither stream is particularly useful on its own.

The Insight: When a news article with strong sentiment AND unusual market volume occur within 60 seconds of each other, the probability that the market is reacting to the news in real time is significantly higher. This is the joint signal that matters.

How It Works:

Stream 1 โ€” Equity trades (market.trades): Polygon.io WebSocket โ†’ Kafka โ†’ Faust โ†’ computes volume z-score per ticker. Z-score cached in memory with timestamp for fast correlation lookups.

Stream 2 โ€” News headlines (market.news): Finnhub API polling every 60s โ†’ Kafka โ†’ VADER sentiment analysis. On each article: look up ticker's z-score cache.

Join logic (60-second temporal window): IF published_at is within 60 seconds of a z-score > 2.0: correlation detected. strength = f(sentiment_magnitude, z-score_magnitude). strong: |sentiment| >= 0.5 AND z-score >= 3.0. moderate: |sentiment| >= 0.2 AND z-score >= 2.0. weak: correlated but low magnitude.

Why VADER over FinBERT: VADER: <1ms inference, rule-based, no GPU required. FinBERT: 200-400ms inference, requires GPU for production throughput. For real-time stream processing on a single server, VADER's speed tradeoff is correct. FinBERT could enrich high-signal events async without blocking the main stream โ€” a natural future extension.

Why 60 seconds: Algorithmic traders typically react to headlines within seconds. Human traders within minutes. A 60-second window captures the initial market reaction while excluding longer-term price drift that may have other causes.

Tech Stack

Every technology chosen for a specific reason
๐Ÿ”ด

Kafka

v7.5.0

Chosen over Kinesis for replay capability and richer ecosystem.

๐Ÿ

Faust

v0.10.14

Python-native stream processing; lighter than Spark for this scale.

โšก

DynamoDB

Zero-ops managed service; conditional writes for idempotency.

๐Ÿชฃ

S3 + Parquet

Columnar format and Snappy for query efficiency and cost.

๐Ÿค–

Scikit-learn

v1.4.0

Isolation Forest for multi-dimensional anomaly detection.

๐Ÿš€

FastAPI

v0.109.0

Async, auto OpenAPI, Pydantic validation built-in.

๐Ÿณ

Docker Compose

One-command local stack; production parity.

๐Ÿ“Š

CloudWatch

Metrics, alarms, dashboards for all components.

๐Ÿ“ฐ

Finnhub

v2.4.19

Free real-time news API. Chosen over NewsAPI for company-level filtering and higher free tier limits. 60-second polling sufficient for news frequency.

๐Ÿง 

VADER Sentiment

v3.3.2

Rule-based NLP scoring <1ms per headline. Chosen over FinBERT (200-400ms) for real-time stream latency requirements. Compound score: -1.0 to +1.0.

Performance Benchmarks

Measured under sustained load and 3x burst conditions
MetricResultStatus
Sustained throughput14,800 e/secโœ“ Target
Peak throughput (30s burst)31,200 e/secโœ“ 2.1x base
End-to-end latency p5012msโœ“ <100ms
End-to-end latency p9967msโœ“ <100ms
DynamoDB write latency p9918msโœ“ Healthy
Kafka consumer lag (sustained)<50 messagesโœ“ Normal
Anomaly detection inference0.3ms/eventโœ“ Real-time

Test methodology: Custom Python load test simulating market-open burst across 5 tickers. Duration: 10 min sustained + 30s burst at 3x. Measurement: WebSocket receive to DynamoDB write confirmed.

Project Structure

Organized for clarity and separation of concerns
TradePulse/
  • docker-compose.yml
  • requirements.txt
  • Makefile
  • README.md
  • src/
    • config.py
    • producer/
    • processing/
    • validation/
    • storage/
    • api/
    • monitoring/
  • tests/
  • docs/

About the Author

NG

Nikhil Giridharan

Data Engineer

Building production-grade data pipelines and streaming systems. Currently working with Databricks, Apache Kafka, and AWS at scale.

github.com/nikhilgiridharan ยท nikhilgiridharan.com

TradePulse ยท Real-time market data pipeline

Built with Apache Kafka ยท Faust ยท AWS ยท Python

ยฉ 2025 Nikhil Giridharan ยท MIT License