Back to Glossary

Fraud & AML

What is a fraud model output calibration report?

A fraud model output calibration report measures how accurately a fraud detection model's predicted probabilities match actual fraud rates across different score ranges. It validates whether a model that predicts 20% fraud probability actually sees 20% fraud in practice.

Why It Matters

Poorly calibrated fraud models can increase false positives by 40-60%, driving customer friction and operational costs up by $2-5 per declined legitimate transaction. Well-calibrated models reduce manual review queues by 25-35% while maintaining detection rates above 85%. Regulatory frameworks like PCI DSS require documented model validation, making calibration reports essential for compliance audits and model governance.

How It Works in Practice

  1. 1Segment transaction volumes into 10-20 score buckets based on model output probabilities
  2. 2Calculate actual fraud rates within each bucket over a rolling 90-day period
  3. 3Compare predicted versus observed fraud rates using statistical measures like Brier score
  4. 4Generate visual plots showing calibration curves and reliability diagrams
  5. 5Document deviations exceeding 5% threshold between predicted and actual rates
  6. 6Recommend model retraining or recalibration when systematic bias is detected

Common Pitfalls

Using insufficient sample sizes in score buckets leads to unreliable calibration metrics and false confidence in model performance

Failing to account for seasonal fraud patterns when measuring calibration can mask model degradation during high-risk periods

Regulatory examiners may reject calibration reports that don't include holdout test sets, violating model validation requirements under SR 11-7 guidance

Key Metrics

MetricTargetFormula
Calibration Error<5%Average absolute difference between predicted and observed fraud rates across all score buckets
Brier Score<0.15Sum of (predicted_probability - actual_outcome)² divided by total observations
Model Reliability>90%Percentage of score buckets where predicted and actual fraud rates differ by less than 3%

Related Terms