Life Insurance & Annuities — Article 10 of 12

Data Warehousing for Actuarial Modeling (Mortality, Lapse, Persistency)

Actuarial models are only as defensible as the data feeding them. Modern life and annuity carriers are replacing nightly extracts and Access databases with cloud warehouses that serve Prophet, AXIS, and MG-ALFA from a single governed source — cutting experience study cycle times from quarters to weeks.

12 min read
Life Insurance & Annuities

Walk into the actuarial department of a mid-size U.S. life carrier and you will typically find three things: a Prophet or AXIS grid running on Azure or AWS, a shared drive containing 8-12 years of quarterly experience study spreadsheets, and at least one senior actuary who knows where the bodies are buried in the data feeds from the 1990s-era policy admin system. The model is sophisticated. The data pipeline feeding it usually is not.

For VM-20 principle-based reserves, GAAP LDTI (ASU 2018-12), Solvency II equivalents, and internal pricing, the assumption set — mortality, lapse, persistency, partial withdrawal, premium persistency, dynamic policyholder behavior — drives reserves and capital that can swing tens of millions of dollars on a single basis-point change. The data warehouse layer beneath those assumptions is now the binding constraint on model credibility, audit defensibility, and the speed at which actuarial teams can respond to emerging experience. This article covers what a modern actuarial data warehouse looks like, where carriers are landing on cloud platform choices, and the specific pipelines required to support mortality, lapse, and persistency studies at production grade.

Why the Old Architecture Is Failing

The legacy pattern at most carriers looks like this: the policy admin system (LifePRO, wmA, ALIP, Vantage, CyberLife, or a homegrown COBOL platform) produces a nightly or monthly extract. That extract is FTP'd to an actuarial server, transformed by SAS or Python scripts maintained by a single actuarial systems analyst, joined to a death claims feed from a separate claims module, and loaded into a model point file for the actuarial projection software. Each transformation is undocumented, version control is informal, and reconciliation to the general ledger is performed manually at quarter-end.

Three forces are breaking this model. First, VM-20 and LDTI require granular, auditable experience data with a clear lineage from raw policy records to assumption inputs — the NAIC Valuation Manual specifically calls for documentation of data quality reviews under VM-31. Second, machine learning approaches to lapse and dynamic policyholder behavior (gradient-boosted models, survival analysis with time-varying covariates) demand far richer feature sets than the 15-20 fields a typical model point file carries. Third, M&A activity in the closed-block space — Global Atlantic, Resolution Life, Venerable, Constellation — means actuarial teams now routinely inherit 4-7 source systems and need to integrate experience across them within 12 months of a deal close.

60-75%Share of an experience study cycle typically consumed by data preparation, reconciliation, and exception handling rather than analysis, based on benchmarks across mid-size U.S. life carriers

Reference Architecture for an Actuarial Data Warehouse

The architecture converging across carriers separates four layers: ingestion, a raw or bronze policy history store, a curated actuarial layer with conformed dimensions, and consumption-specific marts for experience studies, model point generation, and assumption monitoring. Snowflake and Databricks dominate the platform choice for greenfield builds; Azure Synapse appears at carriers already heavily committed to Microsoft, and a small number of large mutuals (Northwestern Mutual, MassMutual, New York Life) operate hybrid on-prem Teradata or Exadata environments alongside cloud lakes.

The ingestion layer typically uses Fivetran, Qlik Replicate, or Informatica IDMC to perform change data capture from the policy admin source. For mainframe sources, IBM InfoSphere CDC or Precisely Connect handle the EBCDIC-to-UTF8 conversion and copybook parsing. The raw layer stores every record version with effective-dated history — this is non-negotiable because experience studies must be able to reconstruct policy state as of any prior valuation date. A common rule of thumb: retain 25+ years of policy-level history for whole life and 15+ for term and annuity blocks, which aligns with the developed mortality study horizons used by the SOA Individual Life Experience Committee.

Platform Choices for Actuarial Data Warehouses
PlatformStrengthsCommon UseTypical Annual Cost (mid-size carrier)
SnowflakeSeparation of compute and storage, time travel for as-of reporting, strong governance via HorizonExperience studies, assumption monitoring, Prophet/AXIS feeds$1.2M-$3.5M
DatabricksNative ML for lapse models, Delta Lake versioning, MLflow integrationPredictive lapse, dynamic policyholder behavior, GenAI on policyholder text$1.5M-$4M
Azure Synapse / Microsoft FabricTight integration with Power BI, Purview lineage, Azure-native securityCarriers standardized on Microsoft stack$800K-$2.5M
On-prem Teradata / ExadataPredictable performance, existing skills, no data egress concernsLarge mutuals with established environments$3M-$8M (including hardware refresh)

The curated actuarial layer is where the discipline pays off. It contains conformed dimensions for policy, coverage, life insured, agent, product, and reinsurance treaty, plus fact tables for policy month-end status, transactions (premium, withdrawal, loan, dividend), and events (lapse, surrender, death, reinstatement). Effective-dated joins are standard. This layer should reconcile to the general ledger reserve roll-forward within a tolerance of 0.05% — anything wider and the assumption studies built on top will not survive an external audit under the Model Audit Rule.

Mortality Experience Studies: The Data Requirements

A defensible mortality study requires exposure measured in life-years (or amount-years for face-amount-weighted studies) and deaths attributed to the same exposure window. The data warehouse must produce, for every policy month, the in-force status, attained age, duration since issue, smoker status, underwriting class, face amount band, distribution channel, and any rider flags. For studies feeding VM-20 deterministic and stochastic reserves, the granularity needs to support the SOA 2015 VBT structure (relative risk 60-160% slopes) and the 2017 CSO valuation table mapping.

Where carriers get into trouble is death reporting lag and IBNR. Death claims arrive 30-180 days after the actual date of death; lapses initiated by the policyholder are sometimes later overturned as deaths when a claim arrives. A production-grade warehouse handles this with a 'completion factor' triangle calculated monthly, plus a reopener process that retroactively reclassifies prior lapses as deaths when matching claims arrive. The Society of Actuaries Individual Life Experience Committee documents typical reporting lags of 4-8 months for the 95th percentile of claims; experience studies that don't adjust for this systematically understate recent-duration mortality by 15-25%.

⚠️The Social Security DMF Gap
Since the 2011 restriction on the Social Security Administration's Death Master File public file, carriers have lost 30-40% of the historical death matches they previously used to validate policyholder mortality on non-claimed policies (typically smaller paid-up policies and unclaimed property candidates). Modern warehouses now integrate LexisNexis Risk Solutions, Verisk's LifeLink, or the NAIC-supported state unclaimed property feeds to close this gap. Without this, lapse studies on small-face whole life systematically misclassify deaths as lapses and overstate true voluntary lapse rates by 50-200 basis points at older durations.

Lapse and Persistency: From Static Tables to Predictive Models

Lapse modeling has changed more in the past five years than in the prior thirty. Traditional approaches produced lapse rate tables by duration and product, sometimes with a shock at the end of the surrender charge period. For variable and indexed annuities, dynamic lapse formulas adjusted base lapses based on the moneyness of guaranteed living benefits (GLWB, GMWB, GMAB). These approaches are still the regulatory backbone — VM-21 for variable annuities and VM-22 (effective 1/1/2026) for fixed annuities specify the framework — but carriers are increasingly running parallel ML models for pricing, in-force management, and early-warning monitoring.

The feature set required for a competitive lapse model goes well beyond what a typical model point carries. At one large indexed annuity carrier I worked with, the production model uses 87 features including: contract age in months, attained age, surrender charge percentage remaining, account value relative to premium, rider charge as percent of account value, recent index credit relative to peer products, distribution channel (career, IMO, bank, wirehouse), agent tenure, agent's book-level lapse rate, ZIP-code-level demographic overlays, and 24-month rolling interaction history (statement views, call center contacts, address changes). The data warehouse feeds this through a Databricks feature store that maintains point-in-time correctness — critical to avoid leakage when training on historical data.

We cut our experience study cycle from 14 weeks to 5 weeks once we stopped rebuilding the data layer for every study. The actuarial work didn't get faster — the data prep stopped being redone from scratch.
VP of Modeling, top-20 U.S. life and annuity carrier

Persistency for life insurance follows a different pattern. Whole life and universal life lapses cluster in years 1-3 (early lapse, often agent-driven), then settle into a long shallow tail. Term lapses spike at the end of the level premium period (year 10, 15, 20, or 30) when premiums step up to annually-renewable rates — Milliman's published studies show post-level-term lapse rates of 60-90% for 20-year term, with anti-selective mortality on persisters running 200-400% of the underlying VBT. Modeling this requires the warehouse to carry the full premium schedule, current and renewal premium amounts, and the conversion option terms by product. Most policy admin systems store this in product configuration tables that need to be conformed across acquired blocks — a problem the team covered in the policy administration article in this guide.

Typical Lapse Rate Curves by Product (illustrative, per industry studies)

Feeding the Models: Prophet, AXIS, MG-ALFA, and RAFM Integration

The actuarial projection systems — FIS Prophet, Moody's AXIS, Milliman MG-ALFA, and Aon's PathWise — each have their own model point formats and assumption table structures. A modern warehouse generates these as views or extracts rather than as standalone batch jobs. Prophet, the most widely deployed in U.S. life, accepts model points as fixed-width or CSV with column ordering matching the product workspace definition; AXIS uses its own structured input format with embedded validation. The warehouse should produce these atomically on a schedule tied to the close calendar, with row counts and reserve totals reconciled to the source-of-truth ledger before the file is released to the actuarial grid.

Two integration patterns have emerged. The first is the 'thick warehouse, thin model' pattern: the warehouse pre-aggregates policies into compressed model points (typically 50:1 to 200:1 compression depending on product), and the projection model runs on these aggregates. The second is the 'full seriatim' pattern, increasingly viable as cloud compute costs have dropped — every policy is projected individually. Seriatim runs for a 2 million policy block on Prophet Enterprise Cloud now complete in 4-8 hours on a 500-node grid, compared with 36-72 hours five years ago. Carriers running stochastic-on-stochastic for VM-21 variable annuity reserves still need aggregation to fit the runtime budget, but for deterministic GAAP and statutory work, seriatim is becoming the default at top-25 carriers.

🔍Assumption Governance Lives in the Warehouse
Under ASOP 23 (Data Quality) and ASOP 41 (Actuarial Communications), the actuary must be able to trace every assumption back to the underlying experience data, document any adjustments, and disclose limitations. Storing assumptions as versioned tables in the warehouse — with effective dates, study source, approver, and reviewer captured as columns — turns a previously manual audit response into a SQL query. One Top-15 carrier I worked with reduced their VM-31 documentation effort by approximately 1,800 hours per year by treating assumptions as warehouse artifacts rather than spreadsheet outputs.

Data Quality Controls Specific to Life Insurance

Generic data quality tools (Great Expectations, Soda, dbt tests, Monte Carlo, Anomalo) cover the basics: null checks, referential integrity, freshness. Actuarial data needs more specific controls. Policy-month exposure must equal the prior month plus new business minus terminations — any unexplained variance signals a feed problem. Death claim amounts must reconcile to face amount net of policy loans and unpaid premium loans. For universal life, account value roll-forward must tie to the sum of premiums, interest credited, COI deductions, expense charges, and partial withdrawals, with a tolerance typically set at 1 cent per policy per month.

Actuarial Data Quality Controls That Belong in the Warehouse

Identity resolution deserves separate attention. The same life insured frequently holds multiple policies — sometimes across multiple acquired blocks with different policyholder IDs. For mortality studies, treating these as independent overstates exposure and biases A/E ratios. The warehouse should run an MDM process (Informatica MDM, Reltio, or a custom Spark-based matcher) that resolves life-insured identity across systems using deterministic SSN matching where available, augmented by probabilistic matching on name, DOB, and address. At one closed-block consolidator, this resolution increased measured mortality exposure deduplication by 11% and shifted the credibility-weighted A/E ratio on the ages 75+ cohort by 4 points.

The PBR and LDTI Overlay

VM-20 (effective for new business since 2017, fully phased in 2020), VM-21 (variable annuity reserves), and VM-22 (effective January 1, 2026 for fixed annuities) each impose specific data lineage requirements. The actuary must demonstrate that the data used to develop prudent estimate assumptions is consistent with the data used to value the reserves. In practice this means the warehouse must support 'as-of' reproduction of any prior assumption study, retain the source data, and document the assumption-setting process under the company's PBR governance policy.

LDTI (ASU 2018-12, effective for SEC filers in 2023 and private companies in 2025) introduced its own data pressures: cohort-level data with locked-in discount rate assumptions, quarterly remeasurement of liability for future policy benefits, and required disclosures of the rollforward by cohort. Most carriers I have seen meet these requirements by introducing an LDTI mart in the warehouse, joined to the actuarial projection outputs, that calculates the net premium ratio, cohort-level reserves, and required disclosures. The same mart feeds the NAIC and statutory reporting processes and supports CECL-style credit loss disclosures for the asset side.

💡Did You Know?
The SOA's 2015 Valuation Basic Table was developed from approximately 266 million life-years of exposure and 1.6 million death claims contributed by 51 companies. Carriers that contribute to the SOA Individual Life Experience Study get back industry benchmarks at the 26-attribute level — and the data infrastructure required to participate is essentially the same infrastructure needed for internal experience studies.

Implementation Roadmap

A realistic build for a mid-size carrier (1-5 million policies, 2-4 admin systems) runs 18-30 months end-to-end, with experience studies running on the new platform by month 12 and full retirement of legacy extracts by month 24-30. The pattern below is what I've seen succeed; it fails when carriers try to migrate all source systems simultaneously or when the actuarial team is not embedded in the platform team from day one.

Typical Actuarial Data Warehouse Build
1
Months 1-3: Foundation

Platform selection (Snowflake/Databricks/Synapse), governance model, identity resolution strategy. Stand up dev/test/prod environments with role-based access. Define conformed dimensions for policy, life insured, product, agent.

2
Months 4-9: First Source System

Pick the largest admin system. Build CDC ingestion, raw bronze layer with 20+ years of history, curated actuarial layer with policy-month facts. Reconcile to GL within 0.05%. Reproduce one prior mortality study end-to-end as validation.

3
Months 10-15: Experience Study Migration

Migrate mortality, lapse, and persistency studies. Build assumption tables as versioned data. Generate Prophet/AXIS model points from warehouse views. Run parallel for two valuation cycles.

4
Months 16-22: Additional Source Systems

Onboard remaining admin systems including acquired blocks. Cross-system identity resolution. Reinsurance treaty integration covered in the reinsurance automation article.

5
Months 23-30: Predictive Modeling and Retirement

Deploy ML lapse and dynamic behavior models to the feature store. Retire legacy extracts and shared drives. Establish quarterly assumption governance cycle running entirely on warehouse.

The economics generally work out as follows for a 2 million policy carrier. Platform and tooling: $2-4M annually run-rate. Implementation services: $8-15M over 24 months. Internal team: 12-20 FTE (data engineering, actuarial systems, governance). Offsetting benefits: 30-50% reduction in actuarial systems FTE devoted to data preparation, 4-8 week reduction in experience study cycles, 1,500-2,500 hours of audit and PBR documentation effort avoided annually, and — harder to quantify but real — faster detection of emerging experience deterioration in lapse and mortality. One client identified an unexpected 180-basis-point uptick in post-level-term lapse rates on a specific 20-year term cohort within 8 weeks of the actual experience emerging; the prior process would have detected it 7-9 months later.

Assumptions are now data products, not spreadsheet outputs. The carriers that treat them that way close their books faster and defend their reserves better.

Based on observed practice at top-20 U.S. life carriers

What to Get Right

Three things separate the successful builds from the ones that stall. First, the actuarial team must own the curated layer's data contracts — not the data engineering team. The columns, granularities, and definitions in the policy-month fact and life-insured dimension are actuarial decisions, and treating them as IT deliverables produces a warehouse that doesn't quite fit any study. Second, reconciliation to the general ledger must be automated, daily, and visible. Manual quarter-end reconciliation is a leading indicator that the warehouse will eventually be bypassed by an actuary who needs an answer faster. Third, the same warehouse must serve operational use cases — the customer portal, agent compensation, the in-force management workbench discussed in the in-force management article — because shared infrastructure is what funds the ongoing investment in data quality.

Done well, the actuarial data warehouse stops being a project and becomes the platform on which pricing, valuation, ALM, in-force management, and ML-driven decisioning all run. The carriers that have made this transition spend their actuarial talent on judgment and analysis. The ones that haven't are still spending it on reconciling SAS extracts to Excel pivots — and losing 2-3 days of every study cycle to a problem that was solved technically a decade ago.

Frequently Asked Questions

Should we choose Snowflake or Databricks for an actuarial data warehouse?

Both work. Snowflake tends to win when the primary workload is SQL-based experience studies, model point generation, and BI; its time travel and zero-copy cloning features simplify as-of reproduction for VM-31 documentation. Databricks tends to win when ML-driven lapse and dynamic policyholder behavior modeling is a first-class workload, because the feature store and MLflow are tightly integrated. Many top-25 carriers run both, with Databricks for ML and Snowflake or Synapse for governed actuarial reporting.

How do we handle the death reporting lag in experience studies?

Build a completion factor triangle by report month vs. incurral month, refreshed monthly, and apply it to the most recent 6-12 months of exposure when computing A/E ratios. Augment internal claims data with LexisNexis or Verisk DMF-equivalent matches to catch deaths on lapsed and paid-up policies where a claim was never filed. Without this, recent-duration mortality is systematically understated by 15-25% and lapse rates at older durations are overstated by 50-200 basis points.

Can we generate Prophet and AXIS model points directly from the warehouse?

Yes, and this is now standard practice at carriers that have modernized. The warehouse produces model points as views or extracts conforming to the product workspace's expected schema, with row counts and reserve totals reconciled to the source-of-truth ledger before release to the actuarial grid. Both fixed-width and CSV formats are supported by Prophet; AXIS has its own structured input format. Atomic generation tied to the close calendar replaces the brittle SAS scripts most carriers used for the prior 15 years.

How long should we retain policy-level history in the warehouse?

For whole life and universal life blocks, retain 25+ years of effective-dated policy history to support mortality studies aligned with SOA reporting horizons. For term and annuities, 15+ years is typically sufficient. Critically, retain every record version with effective dates rather than only the current state — experience studies must reconstruct policy status as of any prior valuation date, and as-of reproduction is required under VM-31 and ASOP 23.

What is the realistic cost and timeline for a mid-size carrier?

For a carrier with 1-5 million policies and 2-4 admin systems, expect 18-30 months end-to-end, with experience studies running on the new platform by month 12. Platform and tooling typically run $2-4M annually; implementation services $8-15M over 24 months; internal team 12-20 FTE. Offsetting benefits include 30-50% reduction in actuarial systems FTE devoted to data prep, 4-8 week shorter experience study cycles, and substantial reduction in PBR and audit documentation effort.