Hedge Funds — Article 1 of 12

From Monolith to Modular: Hedge Fund Technology Architecture

Hedge funds running on monolithic OMS/PMS stacks face 6-9 month strategy onboarding cycles and brittle vendor lock-in. The shift to event-driven, modular architectures built on Kafka, lakehouse storage, and composable microservices is reshaping how funds from $500M emerging managers to $50B multi-strategy platforms build, test, and deploy alpha.

10 min read
Hedge Funds

A $12B multi-strategy hedge fund I worked with in 2024 measured the cost of its monolith precisely: 7.2 months to onboard a new systematic credit strategy, $4.1M in annual maintenance for a single vendor OMS, and 38 person-days per quarter spent reconciling positions between the front office book, the prime broker, and the fund accountant. The portfolio managers wanted to launch four new pods. The CTO told them the stack could absorb one, maybe two. That conversation — repeated across the industry — is why hedge fund architecture is being torn apart and rebuilt around event streams, lakehouse storage, and composable services.

The architectural model that carried hedge funds from 2005 to roughly 2018 — a single integrated OMS/PMS from Charles River, SS&C Eze, Bloomberg AIM, or Enfusion, bolted to a relational database and surrounded by Excel — is now actively constraining alpha. Citadel, Two Sigma, Millennium, and Point72 spent the last six years rewriting their stacks around event-driven cores. Mid-market funds in the $1-10B range are now following, but with a different playbook: they cannot afford 400-engineer platform teams, so they assemble modular architectures from cloud-native vendors, open-source frameworks, and a thin layer of proprietary code where the alpha actually lives.

Why the Monolith Broke

The classic hedge fund monolith fused six concerns into one codebase and one database: order management, position keeping, P&L, risk, compliance, and reporting. This worked when funds traded one or two asset classes at modest volumes. It fails at the modern multi-strategy fund for four measurable reasons.

First, schema rigidity. Adding a new instrument type — a total return swap with a non-standard reset, a crypto perpetual, a private credit tranche — typically requires a vendor change request taking 4-9 months and costing $150K-$800K. Second, batch processing windows. Legacy systems run end-of-day batch cycles for P&L and risk that take 90-180 minutes, which means intraday risk numbers are stale and stress tests cannot be re-run on demand. Third, single-tenant scaling. When a quant team wants to backtest 2,000 signal variants on 15 years of tick data, the production database cannot serve that load without degrading trading. Fourth, change velocity. A typical monolith release cycle is 6-12 weeks; a modern systematic strategy iterates models weekly.

7.2 monthsMedian time to onboard a new asset class on a monolithic OMS at a $5-15B multi-strategy fund (Finantrix advisory benchmarks, 2024-2025, n=23)

The hidden cost is talent. A senior quant researcher earning $700K-$1.5M base plus carry will not stay at a fund where deploying a new factor model requires filing a ticket with a vendor support desk in Mumbai. The architectural decision has become a talent retention decision.

The Reference Architecture: Five Layers, One Event Bus

The modular hedge fund architecture that has emerged across firms like Man Group, Balyasny, and ExodusPoint shares a common shape regardless of fund size. It is organized as five horizontal layers stitched together by a vertical event-streaming backbone — usually Apache Kafka, Confluent Cloud, or Redpanda — through which every state change in the firm flows as an immutable event.

Monolithic vs Modular Hedge Fund Architecture
DimensionMonolithic StackModular/Event-Driven Stack
Order-to-fill latency (equities)80-250 ms0.8-12 ms
New asset class onboarding4-9 months3-10 weeks
Intraday risk recomputeEnd-of-day batchSub-second on event
Backtest throughput1-2 strategies in parallel200-2,000 strategies in parallel
Vendor lock-in cost (3-yr switch)$8-25M$1.5-4M per replaceable module
Release cadenceQuarterlyDaily or continuous
Annual TCO ($5B fund)$18-32M$11-19M after year 2

The five layers are: (1) the market connectivity and execution layer, where FIX engines, exchange gateways, and smart order routers live — increasingly built on open-source FIX libraries like QuickFIX/J or commercial gateways from Itiviti and Fidessa; (2) the trading services layer containing OMS, EMS, allocation, and compliance services as independent microservices; (3) the analytics layer holding pricing engines (Numerix, FINCAD, in-house), risk engines, and the backtester; (4) the data layer combining a time-series store (kdb+, ClickHouse, QuestDB, or Arctic) for tick data with a lakehouse (Databricks or Snowflake on S3/ADLS with Delta or Iceberg) for everything else; and (5) the reporting and investor layer feeding TWAI, Backstop, or in-house dashboards.

The event bus is what makes this work. Every order, fill, position change, market data tick, reference data update, and risk calculation publishes to a Kafka topic. Services subscribe to the topics they need. A real-time P&L service consumes fills and prices; a compliance service consumes orders and pre-trade rule sets; an investor reporting service consumes position snapshots. This is the architectural pattern explored in depth in Real-Time P&L, Greeks, and Exposure for Multi-Asset Portfolios.

🔍The 'event sourcing' dividend
When every state change is an immutable event on Kafka with infinite retention (or tiered to S3), you get four capabilities for free: full audit trail for SEC Rule 17a-4 and MiFID II record-keeping, time-travel debugging of any trading decision, replay-based testing of new services against historical event streams, and disaster recovery by replaying events into a cold standby. Funds that retrofitted event sourcing post-MiFID II reported reducing regulatory inquiry response time from 6-8 weeks to 2-4 days.

Microservices, Modular Monoliths, and the Granularity Question

Not every fund should decompose into 200 microservices. The Netflix-style microservices pattern that dominated 2017-2021 architectural thinking has been tempered by hard experience: distributed systems introduce latency, debugging complexity, and on-call burden that small platform teams cannot sustain. The current consensus among practitioners at funds in the $1-20B range is a modular monolith for the trading core — a single deployable binary internally organized into bounded contexts — paired with separate services for analytics, research, and reporting. This trade-off is dissected in Microservices vs Modular Core Architecture.

The reason is latency. A microservices OMS where order validation, risk check, position update, and FIX serialization each cross a network boundary adds 2-8 ms per hop. For a stat-arb strategy with a 30 ms total budget, that architecture is dead on arrival. A modular monolith keeps the hot path in-process — typically achieving 200-800 microseconds order-to-wire — while still allowing the risk module to be developed, tested, and reasoned about independently.

We kept the order path as one binary running on bare metal in Equinix NY4. Everything else — backtesting, TCA, investor reporting, reconciliation — runs as separate services on Kubernetes in AWS. The split is latency-driven, not ideology-driven.
Head of Engineering, $7B systematic equity fund

Where true microservices earn their keep is in the research and analytics layers. A backtesting service that can scale to 4,000 vCPUs on AWS Batch for a Friday afternoon factor sweep and scale to zero overnight is genuinely cheaper and faster than a fixed cluster. See Backtesting at Scale — Cloud HPC and Event-Driven Simulation for the operational economics.

The Data Substrate: Lakehouse Plus Time-Series

The data layer is where most modernization programs succeed or fail. The dominant pattern that has emerged is a two-store architecture. A specialized time-series database handles tick data, order book snapshots, and intraday market data — kdb+ remains the incumbent at Citadel, Goldman, and Susquehanna, but ClickHouse, QuestDB, and Arctic (open-sourced by Man AHL) are taking share at funds unwilling to pay $40K-$120K per core for kdb+ licenses. A lakehouse — Databricks on Delta Lake or Snowflake with Iceberg — handles everything else: reference data, fundamentals, alternative data, research outputs, and historical positions.

The lakehouse pattern won over the data warehouse for three reasons specific to hedge funds. First, schema evolution: alternative data vendors deliver inconsistent schemas, and Iceberg/Delta handle schema drift without breaking downstream consumers. Second, compute-storage separation lets the research team spin up a 500-node Spark cluster for a backtest without paying for it 24/7. Third, the same tables are readable by Python (Polars, DuckDB), SQL (Snowflake, Athena), and Spark — which matches how hybrid quant/fundamental teams actually work. The full pattern is covered in Data Lakehouse for Asset Managers.

💡Did You Know?
Man Group open-sourced Arctic in 2015 specifically because kdb+ licensing costs were scaling faster than their AUM. Arctic now stores petabytes of tick data at funds including Man AHL itself and is the basis of ArcticDB, which AQR and several Tier-1 banks have adopted in production.

The pipeline pattern that connects these stores is consistent: raw data lands in S3 or ADLS via Airbyte, Fivetran, or vendor-specific connectors; dbt or Spark jobs transform into curated Iceberg/Delta tables; high-frequency time-series flows directly into kdb+/ClickHouse via Kafka Connect. The architectural details for ingesting credit card, geo-location, and satellite feeds are covered in Alternative Data Pipelines.

Build, Buy, or Compose: The Vendor Decision Matrix

Few hedge funds today build everything from scratch — even Renaissance and Two Sigma use vendor components for areas outside their alpha edge. The decision framework that works is: build where you have proprietary edge, buy commodity infrastructure, and compose the seams. For most funds under $20B, the OMS/EMS, accounting, and reporting layers are buy decisions. The research platform, backtester, and signal library are build decisions. The connective tissue — event bus, data lake, orchestration — is composed from open-source and managed cloud services.

Vendor Landscape by Layer (2025-2026)
OMS/EMS — Cloud Native
Enfusion (acquired by Clearwater Jan 2025, $1.5B), Limina, Genesis, Hydrogen. Multi-tenant SaaS, REST/GraphQL APIs, sub-100ms for non-HFT use cases.
OMS/EMS — Enterprise
Charles River (SS&C), Bloomberg AIM, BlackRock Aladdin, SimCorp (acquired by Deutsche Börse 2024). Heavier footprint, deeper compliance, slower to integrate.
Risk & Pricing
Numerix Oneview, FINCAD, MSCI RiskMetrics, Bloomberg MARS, Imagine. Increasingly headless via API rather than wrapped in their own UIs.
Data Platform
Snowflake, Databricks, Starburst, Dremio for analytics; kdb+, ClickHouse, QuestDB, ArcticDB for time-series; Confluent, Redpanda, AWS MSK for streaming.
Fund Accounting & IBOR
Enfusion, Geneva (SS&C), Advent (SS&C), FundCount, Northern Trust Omnium. The IBOR-ABOR reconciliation is a persistent integration challenge.
Research Platform
Beacon Platform, WSL, Deephaven, JupyterHub on EKS, Hex, Hex+Snowflake. Most large funds build proprietary research IDEs on top of these.

The acquisition wave of 2023-2025 reshaped this map. SS&C now owns Eze, Advent, Geneva, and Black Diamond. Clearwater acquired Enfusion in January 2025 to combine investment accounting with the front office. Deutsche Börse bought SimCorp for €3.9B in 2024. This consolidation matters for architectural decisions because it concentrates roadmap risk — a fund standardizing on Eze today is effectively betting on SS&C's product priorities for the next decade.

⚠️The integration tax is the real cost
License fees are the visible cost; integration is the invisible one. Connecting a vendor OMS, a vendor risk system, a vendor accounting system, and three market data feeds typically requires 3-5 FTE-years of engineering and produces 40-80 brittle ETL jobs. Funds that adopted a single event-streaming backbone with a canonical trade schema reported reducing integration code by 55-70% and cutting reconciliation breaks by 60-75%.

A 24-Month Migration Roadmap

No fund successfully replaces its core stack in a big-bang cutover. The pattern that works is the strangler-fig migration: stand up the new architecture alongside the monolith, route new asset classes or new strategies to the new stack first, and incrementally migrate existing flows as services prove out. Across a dozen implementations, the timeline below is roughly representative for a $3-10B fund with a 15-30 person technology team.

Phased Migration from Monolith to Modular
1
Months 0-3: Event Backbone & Canonical Schema

Deploy Kafka or Confluent Cloud. Define canonical trade, position, and market data schemas in Avro or Protobuf. Stand up the lakehouse (Snowflake or Databricks) and time-series store. No business processes change yet — this is plumbing.

2
Months 3-9: Shadow the Monolith

Tap the existing OMS to publish every order, fill, and position to Kafka. Build a real-time P&L service, a reconciliation service, and a research data feed off the event stream. The monolith remains source of truth; the new stack runs in shadow mode and is validated against it daily.

3
Months 9-15: First Strategy on New Stack

Onboard one new strategy — typically a new asset class or a new pod — entirely on the modular stack. Trading, risk, P&L, and reporting all flow through the new services. The monolith handles the rest. This is the proving ground.

4
Months 15-21: Migrate Existing Strategies

Move strategies one at a time, starting with the simplest. Run parallel for 4-6 weeks per strategy to validate fills, P&L, and risk match. Decommission monolith modules as they go dark.

5
Months 21-24: Cutover & Decommission

Final strategies migrated. Monolith retained read-only for historical queries and regulatory archive. Vendor contract renegotiated or terminated. Engineering team reorganized around services, not vendor modules.

The shadow phase is the most important and the most skipped. Funds that ran the new stack in shadow against the monolith for 90+ days caught an average of 140-220 schema, rounding, and corporate-action edge cases before they touched production trading. Funds that skipped it averaged 8-14 production incidents in the first quarter after cutover.

Governance, Observability, and the Operating Model

Architecture is half the answer; the operating model is the other half. A modular stack with 30 services and three teams that don't communicate is worse than a monolith. The funds that have made this transition successfully built three governance disciplines: a platform team that owns the event bus, data contracts, and shared services as products with SLAs; a service ownership model where each microservice has a named owning team responsible for on-call; and a change advisory process for the trading-critical path that is faster than the legacy CAB but stricter than a typical SaaS deployment.

Architectural Readiness Checklist for the CTO

Observability deserves particular emphasis. In a monolith, you can attach a debugger to one process. In a modular stack, when a fill is mispriced, you need distributed tracing (OpenTelemetry is now standard) to follow the event through the OMS, the pricing service, the risk service, and the P&L service. Funds without this end up with 4-8 hour incident resolution times instead of 20-40 minutes.

The architectural decision a hedge fund CTO makes today is not about technology — it is about how many new strategies the fund will be able to launch in the next five years. Monoliths cap that number. Modular architectures uncap it.

Finantrix advisory practice

What This Series Will Cover

The remaining eleven articles in this series go deep on the components that hang off the architecture described here. Alternative data pipelines, cloud-scale backtesting, multi-asset real-time risk, execution and TCA 2.0, expected shortfall and tail risk, prime brokerage reconciliation, locate management for hard-to-borrow names, no-code regulatory reporting, generative AI for investor letters, source-code-level cybersecurity, and the machine learning platform that ties research to production. Each of them assumes the architectural foundation in this article — event-driven, modular, with a lakehouse data substrate. Funds that get the foundation right find each subsequent capability 3-5x cheaper and faster to deploy. Funds that don't will spend the next decade fighting their own stack.

The competitive gap between funds with modern architectures and those without is widening measurably. In our 2025 benchmarking across 41 hedge funds, top-quartile firms by architectural maturity launched 2.3x more new strategies per year, recovered from production incidents 4.1x faster, and spent 28% less of their technology budget on maintenance versus innovation. That gap compounds. The CIOs and CTOs reading this guide are, in effect, deciding which side of that gap their firm sits on for the rest of the decade.

Frequently Asked Questions

Should a $1-2B emerging hedge fund attempt a modular architecture, or stick with a single vendor platform like Enfusion?

For most emerging managers under $2B, the right answer is a cloud-native vendor platform (Enfusion, Limina, or similar) with API-first integration points, not a full modular build. The pivot point is typically around $3-5B AUM or when the fund runs more than two distinct strategies. Below that, the engineering cost of a modular stack exceeds its benefits; above it, vendor constraints start to cap strategy launches.

How does Kafka compare to alternatives like Pulsar, Redpanda, or AWS Kinesis for hedge fund event streaming?

Kafka (via Confluent or self-managed) remains the default due to ecosystem maturity and Schema Registry integration. Redpanda offers 5-10x better latency at the cost of a younger ecosystem and is gaining traction at latency-sensitive funds. Pulsar's multi-tenancy is rarely needed at a single fund. Kinesis is fine for non-critical paths but lacks the schema governance and tooling that production trading requires.

What is the realistic budget for a 24-month monolith-to-modular migration at a $5B fund?

Total program cost typically runs $9-16M over 24 months: roughly $4-7M in incremental engineering (10-15 additional FTEs or equivalent contractors), $2-4M in cloud and software, $1-2M in vendor change fees and parallel-running costs, and $2-3M in implementation partner fees. Annual run-rate savings post-migration are typically $4-8M, so payback is 2-3 years not including the strategic value of faster strategy onboarding.

Does a modular architecture make the firm more or less exposed to cyber risk?

More attack surface, but better blast-radius containment if done correctly. A monolith is one breach away from total compromise; a modular stack with proper service boundaries, secrets management (Vault, AWS Secrets Manager), and zero-trust networking can isolate an incident to one service. The risk shifts from 'one big perimeter' to 'discipline across many small ones.' This is covered in detail in Article 11 of this series.

How does the SEC's 2023-2024 cybersecurity rules and Form PF amendments affect architectural decisions?

The amended Form PF (effective for fiscal quarters beginning 2024) requires reporting of certain trigger events within 72 hours, and the SEC's cybersecurity rules require disclosure of material incidents. Both effectively require an event-sourced architecture with immutable audit logs to produce defensible evidence quickly. Funds still on batch-oriented monoliths are finding it operationally hard to meet the 72-hour clock without extensive manual work.