A mid-sized systematic equity manager we worked with in 2024 was running roughly 40,000 strategy backtests per month on a 96-core on-prem cluster. Each full-universe US equity backtest over 15 years of minute-bar data took 6-8 hours. Their research throughput was the binding constraint on alpha generation — not ideas, not data, not talent. After re-architecting onto a Ray-based event-driven engine running on AWS Graviton3 spot fleets, the same backtest completed in 11 minutes at roughly 38% of the previous all-in cost. The research team went from testing 3 hypotheses per analyst per week to 25-30. Within nine months, two of the new strategies were in production with combined Sharpe above 1.4.
This is the operational reality of backtesting at modern hedge funds: it is not a Jupyter notebook exercise, it is an HPC discipline. The firms producing durable systematic alpha — Renaissance, Two Sigma, AQR, Man AHL, Citadel's quant pods — treat their simulation infrastructure with the same rigor that semiconductor companies treat EDA toolchains. The infrastructure choices determine what research is even thinkable. This article is the third in our Systematic Alpha guide and assumes you have read the architecture and alternative data pieces.
Event-Driven vs. Vectorized: Choosing the Simulation Paradigm
There are two dominant simulation architectures, and conflating them is the single most common mistake we see in research platform design. Vectorized backtesters — pandas/NumPy-based, Zipline-style, vectorbt, QuantConnect's research mode — operate on rectangular arrays of prices and signals. They are 50-200x faster than event-driven systems for simple long-short equity strategies because they exploit BLAS-optimized matrix operations. A 10-year daily backtest on 3,000 names runs in 2-4 seconds. The cost: they cannot model order book dynamics, partial fills, queue position, or stateful risk limits without breaking the vectorization.
Event-driven engines — Nautilus Trader, Lean (QuantConnect's production engine), Backtrader, and most proprietary systems at quant funds — process a chronological event stream: market data tick, signal, order, fill, position update, risk check. Each event mutates state, and the same code path runs in backtest and production. This is the only viable approach for intraday strategies, options market-making, futures roll logic, or anything where the strategy reacts to its own fills. The penalty is throughput: a tick-level event-driven backtest of a single equities strategy over one year can process 4-8 billion events and consume 90 minutes on a single core.
| Dimension | Vectorized | Event-Driven |
|---|---|---|
| Throughput (daily bars, 3k names, 10y) | 2-4 seconds | 4-12 minutes |
| Microstructure realism | Low — close-to-close fills | High — queue, latency, partial fills |
| Production code parity | Re-implement for live | Same engine in backtest and live |
| Suitable horizon | Daily/weekly factor strategies | Intraday, HFT, options MM |
| Typical tools | vectorbt, Zipline, pandas | Nautilus Trader, Lean, custom C++/Rust |
| State management | Stateless, functional | Stateful, message-passing |
The pragmatic answer for most multi-strategy shops: run both. Use vectorized engines for factor screening and hyperparameter sweeps where 80% accuracy in 2 seconds beats 99% in 10 minutes. Promote surviving candidates into the event-driven engine for realistic execution simulation and final sign-off. The two engines must share the same data layer and the same instrument master — otherwise you will spend six months debugging why the vectorized backtest shows 1.8 Sharpe and the event-driven shows 0.6.
The Cloud HPC Substrate: Economics and Patterns
On-prem HPC for backtesting peaked around 2018. The economics no longer make sense for any fund below roughly $5B AUM running fewer than 8,000 cores continuously. A typical 1,024-core on-prem cluster with InfiniBand, parallel filesystem (GPFS or Lustre), and 24x7 ops carries fully-loaded costs of $2.8-3.6M per year. The same compute capacity rented as AWS Graviton3 spot instances (c7g.16xlarge at roughly $0.42/hour spot in us-east-1 as of Q1 2026) costs $1.5-1.9M annually at 70% utilization — and scales to 50,000 cores for a 2-hour burst when needed.
The three architectural patterns that work in production:
Pattern 1: Embarrassingly parallel sweeps via Ray or Dask. The research workflow generates a Cartesian product of parameters (lookback windows, signal thresholds, universe slices) and dispatches each combination as an independent task. Anyscale's managed Ray, Coiled's managed Dask, and AWS Batch on Fargate Spot all handle this well. Two Sigma and Man AHL have spoken publicly about Ray-based research clusters with 10,000+ concurrent workers. The throughput math: 50,000 parameter combinations × 90 seconds each = 1,250 core-hours = $1.40 of Graviton spot. Run it over lunch.
Pattern 2: GPU-accelerated vectorized backtesting. For factor research on equity universes, the inner loop is matrix multiplication. NVIDIA RAPIDS (cuDF, cuPy) and JAX on A100/H100 deliver 80-300x speedup over CPU pandas for portfolio construction operations. WorldQuant's Brain platform and Qube Research's internal stack lean heavily on GPU acceleration. A typical pattern: keep 5-10 years of daily returns and factor exposures resident in GPU memory (80GB H100 holds the entire Russell 3000 history with room to spare), then sweep alpha combinations in seconds.
Pattern 3: Event-driven scaling via shard-by-instrument or shard-by-time. Event-driven engines cannot trivially parallelize because of state dependencies. The workable shardings: parallelize across instruments (works for single-name strategies, breaks for pair trades and portfolio strategies) or across non-overlapping time windows with state checkpoints at boundaries. Citadel and Jane Street have invested heavily in deterministic, replayable event engines specifically to make time-sharding work without violating causality.
The Data Layer: Point-in-Time or It Didn't Happen
Backtest infrastructure without point-in-time (PIT) data discipline is a Sharpe-ratio fabrication machine. Three sins account for roughly 80% of the spurious alpha we see in due-diligence reviews:
Restated fundamentals. Compustat, FactSet, and Refinitiv all restate financials when companies file corrections. A 2024 study by S&P Global Market Intelligence found that 23% of quarterly earnings figures are restated at least once within 18 months. Using as-reported data without PIT vintaging inflates value-factor backtests by 60-110 bps annualized. Vendors that solve this properly: S&P Compustat Point-in-Time, FactSet Fundamentals PIT, Wharton WRDS PIT snapshots.
Survivorship bias in the security master. Backtests run on the current Russell 3000 implicitly exclude every delisted, bankrupt, and acquired company. The 2000-2002 and 2008-2009 returns of such backtests overstate live performance by 200-400 bps. CRSP and Norgate provide delisting-inclusive databases; any internal security master must carry effective dates and corporate action histories.
Look-ahead in alternative data. Credit card panels, satellite parking-lot counts, and web-scraped data are typically delivered with 2-21 day lags but often timestamped at the underlying event date. A naive join produces a backtest that 'knew' Q3 consumer spending three weeks before it was actually available. The fix — discussed in detail in our alternative data pipelines article — is dual-timestamping every record with event_time and arrival_time, and joining only on arrival_time ≤ as_of.
On the storage side, the practical choices for tick and bar data: kdb+ remains dominant at the largest quant shops (Citadel, Millennium, DRW) with column-oriented storage and q-language analytics, despite license costs of $25,000-75,000 per core. ArcticDB (Man AHL's open-source columnar store on S3) has gained substantial adoption since its 2023 release — it delivers kdb-like query latencies on commodity object storage at roughly 5% of the cost. Parquet on S3 with DuckDB or Polars query engines covers the long tail. The benchmark to beat: read 100 million ticks for a single name across a year in under 800ms.
Statistical Rigor: Walk-Forward, Purged CV, and the Multiple Testing Problem
Compute scale is dangerous without statistical discipline. If you run 10,000 backtests, roughly 500 will achieve a Sharpe above 2 by pure chance even when the true Sharpe is zero. Marcos López de Prado has published extensively on this — his deflated Sharpe ratio and probability of backtest overfitting (PBO) metrics should be table stakes.
The three validation protocols that actually work at scale:
Walk-forward optimization. Train on a rolling 3-5 year window, test on the next 6-12 months, advance, repeat. The out-of-sample concatenated equity curve is your performance estimate. Walk-forward kills 60-70% of in-sample-fit strategies. It is computationally expensive — a 15-year walk-forward with monthly re-fits is 180 backtests, not one — which is exactly why cloud HPC matters.
Combinatorial Purged Cross-Validation (CPCV). López de Prado's method splits the timeline into N blocks, holds out k of them as test, and purges training observations whose labels overlap with the test window. For N=10, k=2, this yields 45 test paths instead of 1, and produces a distribution of Sharpe ratios rather than a point estimate. Strategies that look great in standard backtests but show 40%+ Sharpe variance across CPCV paths are almost certainly overfit.
Synthetic data and bootstrap validation. Generate 1,000 alternative price paths via block bootstrap or generative models (Wasserstein GANs trained on returns), run the strategy on each, and ask whether real-data performance falls in the top 5%. JPMorgan's Quant Research group and AQR have published on this. The compute is significant — 1,000× your normal backtest cost — but it is the only way to test against regime shifts the historical sample does not contain.
The Research-to-Production Gap
The most expensive failure mode in systematic investing is the strategy that backtests beautifully and bleeds in production. Post-mortems on these failures rarely identify alpha decay — they almost always find a discrepancy between the simulation and the live environment. Three discrepancies dominate:
Execution model mismatch. The backtest assumed VWAP-like fills at the arrival price. Production routes through a smart order router (discussed in our SOR and TCA 2.0 article) that experiences 8-15 bps of implementation shortfall on mid-cap names. A strategy with 25 bps gross alpha per trade and 40% turnover annually loses $4M per $100M deployed.
Latency asymmetry. The backtest processed signal-to-order in zero time. Production has 200μs to 2ms of decision latency, 500μs to 5ms of network latency, and exchange matching engine delays. For strategies with holding periods under 30 minutes, this routinely halves the realized Sharpe.
Universe and constraint drift. The backtest universe was the academic CRSP universe. Production has internal restricted lists, ESG exclusions, prime broker borrow constraints, and counterparty limits. We have seen funds discover that 18% of their backtest's PnL came from names they could not actually trade.
If your backtest engine and your production engine are different codebases, the difference between them is your unhedged research risk. Make them the same code or accept that your Sharpe will degrade 30-50% in live trading.
— Head of Quant Engineering, top-five systematic fund
The solution is unified runtime: one engine, two modes. Nautilus Trader and QuantConnect's Lean were architected this way from inception. Many internal systems achieve the same outcome via a shared C++ core with separate market data adapters for historical replay and live feeds. The deterministic replay test — feed yesterday's market data through the production engine in simulation mode and confirm it produces the exact same orders the live system sent — should be part of nightly CI.
Vendor Landscape and Build-vs-Buy
The build-vs-buy decision for backtesting infrastructure depends primarily on strategy type and AUM:
Our heuristic: funds under $500M AUM should buy (QuantConnect, WRDS, or similar) and direct engineering capacity toward alpha research. Funds between $500M and $3B should adopt an open-source core (Lean or Nautilus) and customize the data and execution layers. Funds above $3B with multiple strategy pods almost always need to build, because the platform itself becomes a competitive moat — see Two Sigma's Beacon, Citadel's research stack, and Man Group's ArcticDB ecosystem as proof points.
Operational Pattern: A Reference Architecture
Researcher specifies universe, factors, dates. Platform validates PIT integrity and returns a frozen data snapshot with a content hash. Reproducibility starts here.
Parameter sweep across 1,000-50,000 combinations runs on GPU-accelerated vectorized engine. Output: top 50-200 candidate strategies by deflated Sharpe. Wall-clock: 5-30 minutes.
Top candidates re-run through event-driven engine with realistic transaction costs, borrow, and latency. CPCV with 45+ paths. Wall-clock: 2-12 hours on 500-2000 cores.
Market impact model estimates capacity. Strategy stressed against 2008, 2020, 2022 regimes and 1000 synthetic paths. Survivors get a memo.
Strategy runs in production engine against live data with no order submission. 30-90 days of paper performance compared against simulated performance on the same dates. Tracking error must be under 15% of strategy volatility.
Risk committee allocates initial capital, typically 10-20% of target. Scaled up over 60-180 days based on live performance vs. expectation.
The end-to-end time from idea to first capital, in a well-instrumented research platform, is 6-12 weeks. In a poorly-instrumented one, it is 6-12 months — and the bottleneck is almost always infrastructure, not insight. This connects directly to the ML platform discussion later in this guide, since ML strategies impose additional requirements around feature stores, model versioning, and explainability that the backtest infrastructure must support.
What to Build in the Next 12 Months
For CTOs and heads of quantitative engineering planning 2026 investment, the priority order we recommend, based on roughly 30 implementations over the past five years:
First, unify the data layer. PIT-vintaged storage with arrival timestamps, a shared instrument master, and corporate-action-aware adjustments. Without this, every downstream investment is built on sand. Budget: $400K-1.2M, 4-9 months.
Second, deploy a managed elastic compute layer. Ray on Anyscale, Dask on Coiled, or AWS Batch with Graviton spot. Target 5,000-20,000 burst cores at $0.001-0.002 per core-hour blended cost. Budget: $200-500K plus ongoing compute spend.
Third, enforce statistical discipline in code. Deflated Sharpe, CPCV, and PBO computed automatically on every backtest. Strategies cannot be promoted without passing thresholds. This is a 2-4 month engineering project that prevents 60%+ of overfit strategies from reaching production.
Fourth, close the research-to-production gap. Unified runtime, deterministic replay, daily simulation-vs-live reconciliation. This is the hardest piece, typically 9-18 months, but it is where realized Sharpe lives or dies.
The funds that have made these four investments — and roughly 40-50 globally have — are operating in a different competitive regime. They test 100x more hypotheses, kill bad strategies 5x faster, and ship live with 30-50% less Sharpe degradation than peers. Backtest infrastructure is no longer a back-office concern; it is the manufacturing capability of a systematic fund.