Hedge Funds — Article 3 of 12

Backtesting at Scale — Cloud HPC and Event-Driven Simulation

Modern systematic funds run millions of strategy simulations against decades of tick data. This deep-dive covers the architecture, cloud economics, and statistical discipline required to backtest at institutional scale without fooling yourself.

11 min read
Hedge Funds

A mid-sized systematic equity manager we worked with in 2024 was running roughly 40,000 strategy backtests per month on a 96-core on-prem cluster. Each full-universe US equity backtest over 15 years of minute-bar data took 6-8 hours. Their research throughput was the binding constraint on alpha generation — not ideas, not data, not talent. After re-architecting onto a Ray-based event-driven engine running on AWS Graviton3 spot fleets, the same backtest completed in 11 minutes at roughly 38% of the previous all-in cost. The research team went from testing 3 hypotheses per analyst per week to 25-30. Within nine months, two of the new strategies were in production with combined Sharpe above 1.4.

This is the operational reality of backtesting at modern hedge funds: it is not a Jupyter notebook exercise, it is an HPC discipline. The firms producing durable systematic alpha — Renaissance, Two Sigma, AQR, Man AHL, Citadel's quant pods — treat their simulation infrastructure with the same rigor that semiconductor companies treat EDA toolchains. The infrastructure choices determine what research is even thinkable. This article is the third in our Systematic Alpha guide and assumes you have read the architecture and alternative data pieces.

Event-Driven vs. Vectorized: Choosing the Simulation Paradigm

There are two dominant simulation architectures, and conflating them is the single most common mistake we see in research platform design. Vectorized backtesters — pandas/NumPy-based, Zipline-style, vectorbt, QuantConnect's research mode — operate on rectangular arrays of prices and signals. They are 50-200x faster than event-driven systems for simple long-short equity strategies because they exploit BLAS-optimized matrix operations. A 10-year daily backtest on 3,000 names runs in 2-4 seconds. The cost: they cannot model order book dynamics, partial fills, queue position, or stateful risk limits without breaking the vectorization.

Event-driven engines — Nautilus Trader, Lean (QuantConnect's production engine), Backtrader, and most proprietary systems at quant funds — process a chronological event stream: market data tick, signal, order, fill, position update, risk check. Each event mutates state, and the same code path runs in backtest and production. This is the only viable approach for intraday strategies, options market-making, futures roll logic, or anything where the strategy reacts to its own fills. The penalty is throughput: a tick-level event-driven backtest of a single equities strategy over one year can process 4-8 billion events and consume 90 minutes on a single core.

Vectorized vs. Event-Driven Backtesting
DimensionVectorizedEvent-Driven
Throughput (daily bars, 3k names, 10y)2-4 seconds4-12 minutes
Microstructure realismLow — close-to-close fillsHigh — queue, latency, partial fills
Production code parityRe-implement for liveSame engine in backtest and live
Suitable horizonDaily/weekly factor strategiesIntraday, HFT, options MM
Typical toolsvectorbt, Zipline, pandasNautilus Trader, Lean, custom C++/Rust
State managementStateless, functionalStateful, message-passing

The pragmatic answer for most multi-strategy shops: run both. Use vectorized engines for factor screening and hyperparameter sweeps where 80% accuracy in 2 seconds beats 99% in 10 minutes. Promote surviving candidates into the event-driven engine for realistic execution simulation and final sign-off. The two engines must share the same data layer and the same instrument master — otherwise you will spend six months debugging why the vectorized backtest shows 1.8 Sharpe and the event-driven shows 0.6.

The Cloud HPC Substrate: Economics and Patterns

On-prem HPC for backtesting peaked around 2018. The economics no longer make sense for any fund below roughly $5B AUM running fewer than 8,000 cores continuously. A typical 1,024-core on-prem cluster with InfiniBand, parallel filesystem (GPFS or Lustre), and 24x7 ops carries fully-loaded costs of $2.8-3.6M per year. The same compute capacity rented as AWS Graviton3 spot instances (c7g.16xlarge at roughly $0.42/hour spot in us-east-1 as of Q1 2026) costs $1.5-1.9M annually at 70% utilization — and scales to 50,000 cores for a 2-hour burst when needed.

$0.0011Per-core-hour cost of AWS Graviton3 spot at 60% discount — roughly 4x cheaper than equivalent x86 on-prem TCO

The three architectural patterns that work in production:

Pattern 1: Embarrassingly parallel sweeps via Ray or Dask. The research workflow generates a Cartesian product of parameters (lookback windows, signal thresholds, universe slices) and dispatches each combination as an independent task. Anyscale's managed Ray, Coiled's managed Dask, and AWS Batch on Fargate Spot all handle this well. Two Sigma and Man AHL have spoken publicly about Ray-based research clusters with 10,000+ concurrent workers. The throughput math: 50,000 parameter combinations × 90 seconds each = 1,250 core-hours = $1.40 of Graviton spot. Run it over lunch.

Pattern 2: GPU-accelerated vectorized backtesting. For factor research on equity universes, the inner loop is matrix multiplication. NVIDIA RAPIDS (cuDF, cuPy) and JAX on A100/H100 deliver 80-300x speedup over CPU pandas for portfolio construction operations. WorldQuant's Brain platform and Qube Research's internal stack lean heavily on GPU acceleration. A typical pattern: keep 5-10 years of daily returns and factor exposures resident in GPU memory (80GB H100 holds the entire Russell 3000 history with room to spare), then sweep alpha combinations in seconds.

Pattern 3: Event-driven scaling via shard-by-instrument or shard-by-time. Event-driven engines cannot trivially parallelize because of state dependencies. The workable shardings: parallelize across instruments (works for single-name strategies, breaks for pair trades and portfolio strategies) or across non-overlapping time windows with state checkpoints at boundaries. Citadel and Jane Street have invested heavily in deterministic, replayable event engines specifically to make time-sharding work without violating causality.

⚠️Spot Instance Discipline
Spot interruptions average 5-8% per day for c7g instances in busy regions. Your backtest framework must checkpoint state every 60-300 seconds and resume on a new instance within 90 seconds, or you will burn 20-30% of your compute on restarts. Build the checkpoint/resume layer before you scale up — retrofitting it onto a working backtester typically requires a 6-10 week rewrite.

The Data Layer: Point-in-Time or It Didn't Happen

Backtest infrastructure without point-in-time (PIT) data discipline is a Sharpe-ratio fabrication machine. Three sins account for roughly 80% of the spurious alpha we see in due-diligence reviews:

Restated fundamentals. Compustat, FactSet, and Refinitiv all restate financials when companies file corrections. A 2024 study by S&P Global Market Intelligence found that 23% of quarterly earnings figures are restated at least once within 18 months. Using as-reported data without PIT vintaging inflates value-factor backtests by 60-110 bps annualized. Vendors that solve this properly: S&P Compustat Point-in-Time, FactSet Fundamentals PIT, Wharton WRDS PIT snapshots.

Survivorship bias in the security master. Backtests run on the current Russell 3000 implicitly exclude every delisted, bankrupt, and acquired company. The 2000-2002 and 2008-2009 returns of such backtests overstate live performance by 200-400 bps. CRSP and Norgate provide delisting-inclusive databases; any internal security master must carry effective dates and corporate action histories.

Look-ahead in alternative data. Credit card panels, satellite parking-lot counts, and web-scraped data are typically delivered with 2-21 day lags but often timestamped at the underlying event date. A naive join produces a backtest that 'knew' Q3 consumer spending three weeks before it was actually available. The fix — discussed in detail in our alternative data pipelines article — is dual-timestamping every record with event_time and arrival_time, and joining only on arrival_time ≤ as_of.

💡Did You Know?
Renaissance Technologies reportedly maintains a separate 'data physics' team whose sole job is to verify the temporal integrity of every dataset before it is admitted to the research environment. New datasets typically take 4-9 months to clear admission.

On the storage side, the practical choices for tick and bar data: kdb+ remains dominant at the largest quant shops (Citadel, Millennium, DRW) with column-oriented storage and q-language analytics, despite license costs of $25,000-75,000 per core. ArcticDB (Man AHL's open-source columnar store on S3) has gained substantial adoption since its 2023 release — it delivers kdb-like query latencies on commodity object storage at roughly 5% of the cost. Parquet on S3 with DuckDB or Polars query engines covers the long tail. The benchmark to beat: read 100 million ticks for a single name across a year in under 800ms.

Statistical Rigor: Walk-Forward, Purged CV, and the Multiple Testing Problem

Compute scale is dangerous without statistical discipline. If you run 10,000 backtests, roughly 500 will achieve a Sharpe above 2 by pure chance even when the true Sharpe is zero. Marcos López de Prado has published extensively on this — his deflated Sharpe ratio and probability of backtest overfitting (PBO) metrics should be table stakes.

Deflated Sharpe Ratio
DSR = Z[(SR - SR₀) × √(T-1) / √(1 - γ₃·SR + (γ₄-1)/4·SR²)]
SR₀ is the expected maximum Sharpe under the null hypothesis given N trials, computed as √(2·ln(N))·σ̂_SR. T is sample length, γ₃ and γ₄ are skewness and kurtosis of returns. A DSR above 0.95 means the observed Sharpe is statistically distinguishable from the best of N random strategies.

The three validation protocols that actually work at scale:

Walk-forward optimization. Train on a rolling 3-5 year window, test on the next 6-12 months, advance, repeat. The out-of-sample concatenated equity curve is your performance estimate. Walk-forward kills 60-70% of in-sample-fit strategies. It is computationally expensive — a 15-year walk-forward with monthly re-fits is 180 backtests, not one — which is exactly why cloud HPC matters.

Combinatorial Purged Cross-Validation (CPCV). López de Prado's method splits the timeline into N blocks, holds out k of them as test, and purges training observations whose labels overlap with the test window. For N=10, k=2, this yields 45 test paths instead of 1, and produces a distribution of Sharpe ratios rather than a point estimate. Strategies that look great in standard backtests but show 40%+ Sharpe variance across CPCV paths are almost certainly overfit.

Synthetic data and bootstrap validation. Generate 1,000 alternative price paths via block bootstrap or generative models (Wasserstein GANs trained on returns), run the strategy on each, and ask whether real-data performance falls in the top 5%. JPMorgan's Quant Research group and AQR have published on this. The compute is significant — 1,000× your normal backtest cost — but it is the only way to test against regime shifts the historical sample does not contain.

Backtest Hygiene Checklist

The Research-to-Production Gap

The most expensive failure mode in systematic investing is the strategy that backtests beautifully and bleeds in production. Post-mortems on these failures rarely identify alpha decay — they almost always find a discrepancy between the simulation and the live environment. Three discrepancies dominate:

Execution model mismatch. The backtest assumed VWAP-like fills at the arrival price. Production routes through a smart order router (discussed in our SOR and TCA 2.0 article) that experiences 8-15 bps of implementation shortfall on mid-cap names. A strategy with 25 bps gross alpha per trade and 40% turnover annually loses $4M per $100M deployed.

Latency asymmetry. The backtest processed signal-to-order in zero time. Production has 200μs to 2ms of decision latency, 500μs to 5ms of network latency, and exchange matching engine delays. For strategies with holding periods under 30 minutes, this routinely halves the realized Sharpe.

Universe and constraint drift. The backtest universe was the academic CRSP universe. Production has internal restricted lists, ESG exclusions, prime broker borrow constraints, and counterparty limits. We have seen funds discover that 18% of their backtest's PnL came from names they could not actually trade.

If your backtest engine and your production engine are different codebases, the difference between them is your unhedged research risk. Make them the same code or accept that your Sharpe will degrade 30-50% in live trading.

Head of Quant Engineering, top-five systematic fund

The solution is unified runtime: one engine, two modes. Nautilus Trader and QuantConnect's Lean were architected this way from inception. Many internal systems achieve the same outcome via a shared C++ core with separate market data adapters for historical replay and live feeds. The deterministic replay test — feed yesterday's market data through the production engine in simulation mode and confirm it produces the exact same orders the live system sent — should be part of nightly CI.

Vendor Landscape and Build-vs-Buy

The build-vs-buy decision for backtesting infrastructure depends primarily on strategy type and AUM:

Backtesting Platform Options
QuantConnect / Lean
Open-source event-driven engine with managed cloud option. Strong for equities, options, futures, crypto. ~$50-500/month per researcher for managed tier. Used by 250+ funds.
Nautilus Trader
Rust-core event-driven platform with Python bindings. Free open-source. Microsecond-precision event handling. Best for funds with strong engineering teams.
MathWorks MATLAB / Financial Toolbox
Legacy choice at older systematic funds. Strong for factor research. $5-15k per seat annually. Production deployment is painful.
Custom C++/Rust on Ray/Dask
What the top-tier quants build. 18-36 month build-out, $4-15M initial cost, but no ceiling on capability. Justified above ~$3B systematic AUM.
kdb+/q + custom backtest layer
Tick-data dominant choice for HFT and market-making funds. License cost is real but query performance is unmatched for tick analytics.
WRDS Cloud + AWS Batch
Academic-style infrastructure with vendor-managed PIT data. Good fit for fundamental quant shops and emerging managers under $500M AUM.

Our heuristic: funds under $500M AUM should buy (QuantConnect, WRDS, or similar) and direct engineering capacity toward alpha research. Funds between $500M and $3B should adopt an open-source core (Lean or Nautilus) and customize the data and execution layers. Funds above $3B with multiple strategy pods almost always need to build, because the platform itself becomes a competitive moat — see Two Sigma's Beacon, Citadel's research stack, and Man Group's ArcticDB ecosystem as proof points.

Operational Pattern: A Reference Architecture

Research Backtest Lifecycle
1
Hypothesis & Data Pull

Researcher specifies universe, factors, dates. Platform validates PIT integrity and returns a frozen data snapshot with a content hash. Reproducibility starts here.

2
Vectorized Screening

Parameter sweep across 1,000-50,000 combinations runs on GPU-accelerated vectorized engine. Output: top 50-200 candidate strategies by deflated Sharpe. Wall-clock: 5-30 minutes.

3
Event-Driven Validation

Top candidates re-run through event-driven engine with realistic transaction costs, borrow, and latency. CPCV with 45+ paths. Wall-clock: 2-12 hours on 500-2000 cores.

4
Capacity & Stress Testing

Market impact model estimates capacity. Strategy stressed against 2008, 2020, 2022 regimes and 1000 synthetic paths. Survivors get a memo.

5
Paper Trading

Strategy runs in production engine against live data with no order submission. 30-90 days of paper performance compared against simulated performance on the same dates. Tracking error must be under 15% of strategy volatility.

6
Capital Allocation

Risk committee allocates initial capital, typically 10-20% of target. Scaled up over 60-180 days based on live performance vs. expectation.

The end-to-end time from idea to first capital, in a well-instrumented research platform, is 6-12 weeks. In a poorly-instrumented one, it is 6-12 months — and the bottleneck is almost always infrastructure, not insight. This connects directly to the ML platform discussion later in this guide, since ML strategies impose additional requirements around feature stores, model versioning, and explainability that the backtest infrastructure must support.

🎯The CTO's Diagnostic
If your firm cannot answer these four questions in under 60 seconds, your backtest platform is constraining your alpha: (1) How many backtests did the firm run last week? (2) What was the median wall-clock time? (3) What fraction of strategies that passed backtest validation are profitable in live trading after 12 months? (4) What is the marginal cost of one additional backtest at today's scale? At well-run quant shops these numbers are dashboards, not research projects.

What to Build in the Next 12 Months

For CTOs and heads of quantitative engineering planning 2026 investment, the priority order we recommend, based on roughly 30 implementations over the past five years:

First, unify the data layer. PIT-vintaged storage with arrival timestamps, a shared instrument master, and corporate-action-aware adjustments. Without this, every downstream investment is built on sand. Budget: $400K-1.2M, 4-9 months.

Second, deploy a managed elastic compute layer. Ray on Anyscale, Dask on Coiled, or AWS Batch with Graviton spot. Target 5,000-20,000 burst cores at $0.001-0.002 per core-hour blended cost. Budget: $200-500K plus ongoing compute spend.

Third, enforce statistical discipline in code. Deflated Sharpe, CPCV, and PBO computed automatically on every backtest. Strategies cannot be promoted without passing thresholds. This is a 2-4 month engineering project that prevents 60%+ of overfit strategies from reaching production.

Fourth, close the research-to-production gap. Unified runtime, deterministic replay, daily simulation-vs-live reconciliation. This is the hardest piece, typically 9-18 months, but it is where realized Sharpe lives or dies.

The funds that have made these four investments — and roughly 40-50 globally have — are operating in a different competitive regime. They test 100x more hypotheses, kill bad strategies 5x faster, and ship live with 30-50% less Sharpe degradation than peers. Backtest infrastructure is no longer a back-office concern; it is the manufacturing capability of a systematic fund.

Frequently Asked Questions

How much should a $1B systematic hedge fund budget annually for backtesting infrastructure?

A reasonable range is $1.5-3.5M all-in, covering cloud compute ($600-1,200K), data vendors with PIT coverage ($500-1,500K), platform licenses or engineering ($300-800K), and ops headcount. Funds spending less are usually compromising on data integrity; funds spending more typically have not yet migrated off on-prem infrastructure.

Should we use vectorized or event-driven backtesting?

Both, in sequence. Use vectorized engines (vectorbt, custom NumPy/JAX) for parameter sweeps and initial screening where you need to evaluate thousands of variants quickly. Promote top candidates to event-driven engines (Nautilus, Lean, or custom) for realistic execution simulation. Single-engine shops invariably either waste compute or ship overfit strategies.

What is the most underrated source of backtest overfitting?

Selection bias in the research process itself. Even with perfect PIT data and CPCV, if 50 researchers each test 200 ideas and you keep the best 10, you have implicitly run 10,000 trials. The deflated Sharpe ratio must be computed against the firm-wide trial count, not the individual researcher's. Few funds track this.

How do we know our backtest infrastructure is good enough?

Run the same strategy through your backtest engine and your live production engine over the past 30 days using identical market data. If the resulting orders differ by more than 2% in count or 5 bps in cumulative PnL, your engines are not aligned and your live performance will diverge from simulated performance unpredictably. This nightly reconciliation should be a hard gating control.

Is GPU acceleration worth it for backtesting?

For vectorized factor research on equity universes, yes — 80-300x speedups on portfolio construction operations are routine on H100s, and the economics work above roughly 500 backtests per day. For event-driven tick-level simulation, no — branching logic and state mutation do not vectorize well, and CPUs (especially Graviton3 ARM) deliver better price/performance.