Walk into the actuarial department of a mid-size U.S. life carrier and you will typically find three things: a Prophet or AXIS grid running on Azure or AWS, a shared drive containing 8-12 years of quarterly experience study spreadsheets, and at least one senior actuary who knows where the bodies are buried in the data feeds from the 1990s-era policy admin system. The model is sophisticated. The data pipeline feeding it usually is not.
For VM-20 principle-based reserves, GAAP LDTI (ASU 2018-12), Solvency II equivalents, and internal pricing, the assumption set — mortality, lapse, persistency, partial withdrawal, premium persistency, dynamic policyholder behavior — drives reserves and capital that can swing tens of millions of dollars on a single basis-point change. The data warehouse layer beneath those assumptions is now the binding constraint on model credibility, audit defensibility, and the speed at which actuarial teams can respond to emerging experience. This article covers what a modern actuarial data warehouse looks like, where carriers are landing on cloud platform choices, and the specific pipelines required to support mortality, lapse, and persistency studies at production grade.
Why the Old Architecture Is Failing
The legacy pattern at most carriers looks like this: the policy admin system (LifePRO, wmA, ALIP, Vantage, CyberLife, or a homegrown COBOL platform) produces a nightly or monthly extract. That extract is FTP'd to an actuarial server, transformed by SAS or Python scripts maintained by a single actuarial systems analyst, joined to a death claims feed from a separate claims module, and loaded into a model point file for the actuarial projection software. Each transformation is undocumented, version control is informal, and reconciliation to the general ledger is performed manually at quarter-end.
Three forces are breaking this model. First, VM-20 and LDTI require granular, auditable experience data with a clear lineage from raw policy records to assumption inputs — the NAIC Valuation Manual specifically calls for documentation of data quality reviews under VM-31. Second, machine learning approaches to lapse and dynamic policyholder behavior (gradient-boosted models, survival analysis with time-varying covariates) demand far richer feature sets than the 15-20 fields a typical model point file carries. Third, M&A activity in the closed-block space — Global Atlantic, Resolution Life, Venerable, Constellation — means actuarial teams now routinely inherit 4-7 source systems and need to integrate experience across them within 12 months of a deal close.
Reference Architecture for an Actuarial Data Warehouse
The architecture converging across carriers separates four layers: ingestion, a raw or bronze policy history store, a curated actuarial layer with conformed dimensions, and consumption-specific marts for experience studies, model point generation, and assumption monitoring. Snowflake and Databricks dominate the platform choice for greenfield builds; Azure Synapse appears at carriers already heavily committed to Microsoft, and a small number of large mutuals (Northwestern Mutual, MassMutual, New York Life) operate hybrid on-prem Teradata or Exadata environments alongside cloud lakes.
The ingestion layer typically uses Fivetran, Qlik Replicate, or Informatica IDMC to perform change data capture from the policy admin source. For mainframe sources, IBM InfoSphere CDC or Precisely Connect handle the EBCDIC-to-UTF8 conversion and copybook parsing. The raw layer stores every record version with effective-dated history — this is non-negotiable because experience studies must be able to reconstruct policy state as of any prior valuation date. A common rule of thumb: retain 25+ years of policy-level history for whole life and 15+ for term and annuity blocks, which aligns with the developed mortality study horizons used by the SOA Individual Life Experience Committee.
| Platform | Strengths | Common Use | Typical Annual Cost (mid-size carrier) |
|---|---|---|---|
| Snowflake | Separation of compute and storage, time travel for as-of reporting, strong governance via Horizon | Experience studies, assumption monitoring, Prophet/AXIS feeds | $1.2M-$3.5M |
| Databricks | Native ML for lapse models, Delta Lake versioning, MLflow integration | Predictive lapse, dynamic policyholder behavior, GenAI on policyholder text | $1.5M-$4M |
| Azure Synapse / Microsoft Fabric | Tight integration with Power BI, Purview lineage, Azure-native security | Carriers standardized on Microsoft stack | $800K-$2.5M |
| On-prem Teradata / Exadata | Predictable performance, existing skills, no data egress concerns | Large mutuals with established environments | $3M-$8M (including hardware refresh) |
The curated actuarial layer is where the discipline pays off. It contains conformed dimensions for policy, coverage, life insured, agent, product, and reinsurance treaty, plus fact tables for policy month-end status, transactions (premium, withdrawal, loan, dividend), and events (lapse, surrender, death, reinstatement). Effective-dated joins are standard. This layer should reconcile to the general ledger reserve roll-forward within a tolerance of 0.05% — anything wider and the assumption studies built on top will not survive an external audit under the Model Audit Rule.
Mortality Experience Studies: The Data Requirements
A defensible mortality study requires exposure measured in life-years (or amount-years for face-amount-weighted studies) and deaths attributed to the same exposure window. The data warehouse must produce, for every policy month, the in-force status, attained age, duration since issue, smoker status, underwriting class, face amount band, distribution channel, and any rider flags. For studies feeding VM-20 deterministic and stochastic reserves, the granularity needs to support the SOA 2015 VBT structure (relative risk 60-160% slopes) and the 2017 CSO valuation table mapping.
Where carriers get into trouble is death reporting lag and IBNR. Death claims arrive 30-180 days after the actual date of death; lapses initiated by the policyholder are sometimes later overturned as deaths when a claim arrives. A production-grade warehouse handles this with a 'completion factor' triangle calculated monthly, plus a reopener process that retroactively reclassifies prior lapses as deaths when matching claims arrive. The Society of Actuaries Individual Life Experience Committee documents typical reporting lags of 4-8 months for the 95th percentile of claims; experience studies that don't adjust for this systematically understate recent-duration mortality by 15-25%.
Lapse and Persistency: From Static Tables to Predictive Models
Lapse modeling has changed more in the past five years than in the prior thirty. Traditional approaches produced lapse rate tables by duration and product, sometimes with a shock at the end of the surrender charge period. For variable and indexed annuities, dynamic lapse formulas adjusted base lapses based on the moneyness of guaranteed living benefits (GLWB, GMWB, GMAB). These approaches are still the regulatory backbone — VM-21 for variable annuities and VM-22 (effective 1/1/2026) for fixed annuities specify the framework — but carriers are increasingly running parallel ML models for pricing, in-force management, and early-warning monitoring.
The feature set required for a competitive lapse model goes well beyond what a typical model point carries. At one large indexed annuity carrier I worked with, the production model uses 87 features including: contract age in months, attained age, surrender charge percentage remaining, account value relative to premium, rider charge as percent of account value, recent index credit relative to peer products, distribution channel (career, IMO, bank, wirehouse), agent tenure, agent's book-level lapse rate, ZIP-code-level demographic overlays, and 24-month rolling interaction history (statement views, call center contacts, address changes). The data warehouse feeds this through a Databricks feature store that maintains point-in-time correctness — critical to avoid leakage when training on historical data.
Persistency for life insurance follows a different pattern. Whole life and universal life lapses cluster in years 1-3 (early lapse, often agent-driven), then settle into a long shallow tail. Term lapses spike at the end of the level premium period (year 10, 15, 20, or 30) when premiums step up to annually-renewable rates — Milliman's published studies show post-level-term lapse rates of 60-90% for 20-year term, with anti-selective mortality on persisters running 200-400% of the underlying VBT. Modeling this requires the warehouse to carry the full premium schedule, current and renewal premium amounts, and the conversion option terms by product. Most policy admin systems store this in product configuration tables that need to be conformed across acquired blocks — a problem the team covered in the policy administration article in this guide.
Feeding the Models: Prophet, AXIS, MG-ALFA, and RAFM Integration
The actuarial projection systems — FIS Prophet, Moody's AXIS, Milliman MG-ALFA, and Aon's PathWise — each have their own model point formats and assumption table structures. A modern warehouse generates these as views or extracts rather than as standalone batch jobs. Prophet, the most widely deployed in U.S. life, accepts model points as fixed-width or CSV with column ordering matching the product workspace definition; AXIS uses its own structured input format with embedded validation. The warehouse should produce these atomically on a schedule tied to the close calendar, with row counts and reserve totals reconciled to the source-of-truth ledger before the file is released to the actuarial grid.
Two integration patterns have emerged. The first is the 'thick warehouse, thin model' pattern: the warehouse pre-aggregates policies into compressed model points (typically 50:1 to 200:1 compression depending on product), and the projection model runs on these aggregates. The second is the 'full seriatim' pattern, increasingly viable as cloud compute costs have dropped — every policy is projected individually. Seriatim runs for a 2 million policy block on Prophet Enterprise Cloud now complete in 4-8 hours on a 500-node grid, compared with 36-72 hours five years ago. Carriers running stochastic-on-stochastic for VM-21 variable annuity reserves still need aggregation to fit the runtime budget, but for deterministic GAAP and statutory work, seriatim is becoming the default at top-25 carriers.
Data Quality Controls Specific to Life Insurance
Generic data quality tools (Great Expectations, Soda, dbt tests, Monte Carlo, Anomalo) cover the basics: null checks, referential integrity, freshness. Actuarial data needs more specific controls. Policy-month exposure must equal the prior month plus new business minus terminations — any unexplained variance signals a feed problem. Death claim amounts must reconcile to face amount net of policy loans and unpaid premium loans. For universal life, account value roll-forward must tie to the sum of premiums, interest credited, COI deductions, expense charges, and partial withdrawals, with a tolerance typically set at 1 cent per policy per month.
Identity resolution deserves separate attention. The same life insured frequently holds multiple policies — sometimes across multiple acquired blocks with different policyholder IDs. For mortality studies, treating these as independent overstates exposure and biases A/E ratios. The warehouse should run an MDM process (Informatica MDM, Reltio, or a custom Spark-based matcher) that resolves life-insured identity across systems using deterministic SSN matching where available, augmented by probabilistic matching on name, DOB, and address. At one closed-block consolidator, this resolution increased measured mortality exposure deduplication by 11% and shifted the credibility-weighted A/E ratio on the ages 75+ cohort by 4 points.
The PBR and LDTI Overlay
VM-20 (effective for new business since 2017, fully phased in 2020), VM-21 (variable annuity reserves), and VM-22 (effective January 1, 2026 for fixed annuities) each impose specific data lineage requirements. The actuary must demonstrate that the data used to develop prudent estimate assumptions is consistent with the data used to value the reserves. In practice this means the warehouse must support 'as-of' reproduction of any prior assumption study, retain the source data, and document the assumption-setting process under the company's PBR governance policy.
LDTI (ASU 2018-12, effective for SEC filers in 2023 and private companies in 2025) introduced its own data pressures: cohort-level data with locked-in discount rate assumptions, quarterly remeasurement of liability for future policy benefits, and required disclosures of the rollforward by cohort. Most carriers I have seen meet these requirements by introducing an LDTI mart in the warehouse, joined to the actuarial projection outputs, that calculates the net premium ratio, cohort-level reserves, and required disclosures. The same mart feeds the NAIC and statutory reporting processes and supports CECL-style credit loss disclosures for the asset side.
Implementation Roadmap
A realistic build for a mid-size carrier (1-5 million policies, 2-4 admin systems) runs 18-30 months end-to-end, with experience studies running on the new platform by month 12 and full retirement of legacy extracts by month 24-30. The pattern below is what I've seen succeed; it fails when carriers try to migrate all source systems simultaneously or when the actuarial team is not embedded in the platform team from day one.
Platform selection (Snowflake/Databricks/Synapse), governance model, identity resolution strategy. Stand up dev/test/prod environments with role-based access. Define conformed dimensions for policy, life insured, product, agent.
Pick the largest admin system. Build CDC ingestion, raw bronze layer with 20+ years of history, curated actuarial layer with policy-month facts. Reconcile to GL within 0.05%. Reproduce one prior mortality study end-to-end as validation.
Migrate mortality, lapse, and persistency studies. Build assumption tables as versioned data. Generate Prophet/AXIS model points from warehouse views. Run parallel for two valuation cycles.
Onboard remaining admin systems including acquired blocks. Cross-system identity resolution. Reinsurance treaty integration covered in the reinsurance automation article.
Deploy ML lapse and dynamic behavior models to the feature store. Retire legacy extracts and shared drives. Establish quarterly assumption governance cycle running entirely on warehouse.
The economics generally work out as follows for a 2 million policy carrier. Platform and tooling: $2-4M annually run-rate. Implementation services: $8-15M over 24 months. Internal team: 12-20 FTE (data engineering, actuarial systems, governance). Offsetting benefits: 30-50% reduction in actuarial systems FTE devoted to data preparation, 4-8 week reduction in experience study cycles, 1,500-2,500 hours of audit and PBR documentation effort avoided annually, and — harder to quantify but real — faster detection of emerging experience deterioration in lapse and mortality. One client identified an unexpected 180-basis-point uptick in post-level-term lapse rates on a specific 20-year term cohort within 8 weeks of the actual experience emerging; the prior process would have detected it 7-9 months later.
Assumptions are now data products, not spreadsheet outputs. The carriers that treat them that way close their books faster and defend their reserves better.
— Based on observed practice at top-20 U.S. life carriers
What to Get Right
Three things separate the successful builds from the ones that stall. First, the actuarial team must own the curated layer's data contracts — not the data engineering team. The columns, granularities, and definitions in the policy-month fact and life-insured dimension are actuarial decisions, and treating them as IT deliverables produces a warehouse that doesn't quite fit any study. Second, reconciliation to the general ledger must be automated, daily, and visible. Manual quarter-end reconciliation is a leading indicator that the warehouse will eventually be bypassed by an actuary who needs an answer faster. Third, the same warehouse must serve operational use cases — the customer portal, agent compensation, the in-force management workbench discussed in the in-force management article — because shared infrastructure is what funds the ongoing investment in data quality.
Done well, the actuarial data warehouse stops being a project and becomes the platform on which pricing, valuation, ALM, in-force management, and ML-driven decisioning all run. The carriers that have made this transition spend their actuarial talent on judgment and analysis. The ones that haven't are still spending it on reconciling SAS extracts to Excel pivots — and losing 2-3 days of every study cycle to a problem that was solved technically a decade ago.