In Focus/Value Creation Through Technology: PE Portfolio Operating Model

Private Equity — Article 2 of 12

Commercial Due Diligence Automation — TAM/SAM, Unit Economics

Traditional commercial due diligence consumes 6-10 weeks and $400K-$1.2M per deal. Alternative data, NLP, and panel analytics now compress that to 2-3 weeks at 40-60% lower cost — while producing market sizing, cohort economics, and win/loss intelligence that legacy consulting decks rarely deliver.

8 min read

Private Equity

A mid-market buyout shop running 80-120 deals through IC each year typically spends $25-40 million annually on commercial due diligence — Bain, LEK, OC&C, EY-Parthenon, and a long tail of specialist boutiques. The average engagement runs 6-10 weeks at $400K-$1.2M, and roughly 70% of deals that receive a green-light CDD never close. That's $15-25 million per year of effective burn on transactions the firm walks away from. The economics of CDD have not changed materially since the early 2000s, but the underlying data inputs have. Bottom-up market sizing, cohort retention curves, win/loss analysis, and pricing benchmarks can now be assembled in days from alternative data, transcript libraries, and panel sources that did not exist five years ago.

This article is the second in our PE Portfolio Operating Model series. Article 1 covered upstream screening at the funnel-top. Here we focus on the next stage: turning a shortlist of 5-15 active targets per quarter into IC-ready commercial theses using automation. We'll cover the data stack, the modeling techniques, the unit-economics workflows, and the specific failure modes deal teams hit when they over-rely on machine outputs.

Why traditional CDD breaks at mid-market velocity

The classic CDD deliverable — a 120-slide deck with a market map, growth drivers, customer references, and a competitive landscape — was designed for $500M+ enterprise value deals where a six-week sprint is proportionate to the check size. In sub-$200M deals, which now represent the majority of US sponsor activity, that timeline is incompatible with auction dynamics. Sellers routinely demand final bids within 4-5 weeks of CIM release. Sponsors that wait for a finished consulting report either submit on instinct or lose the deal.

The deeper problem is that consultant CDDs rely on three primary inputs: (1) 15-25 expert calls via GLG, Guidepoint, or Third Bridge, (2) syndicated reports from IBISWorld, Frost & Sullivan, or Gartner, and (3) a customer reference list provided by the seller. Each input is biased. Expert networks oversample former employees willing to talk for $400/hour. Syndicated reports recycle 18-month-old data. Seller-provided references are curated. The output reflects the inputs — a directionally correct but rarely surprising market view.

Traditional vs. automated CDD economics

Dimension	Traditional CDD (Bain/LEK/OC&C)	Automation-led CDD
Duration	6-10 weeks	2-3 weeks
Cost per deal	$400K-$1.2M	$60K-$250K (incl. data subscriptions)
Expert calls	15-25 (network-curated)	8-12 (targeted, post-data hypothesis)
Market sizing approach	Top-down from syndicated reports	Bottom-up from panel and transaction data
Customer voice	Seller-curated reference list	NPS, G2 reviews, churn proxies from panel
Cohort economics	Provided by management	Reconstructed from card-panel or app data
Updateability	Static deck	Live dashboard, refreshed weekly

The automated CDD data stack

A working automation stack has five layers. The first is transcript and document intelligence — AlphaSense, Tegus, and Stream by AlphaSense (formerly Sentieo) together index over 150,000 expert call transcripts, every public 10-K and earnings call, and a growing library of private-company expert calls. A deal team running a B2B SaaS thesis can query 'NetSuite displacement in mid-market' and surface 200+ transcripts in under a minute. Tegus alone added roughly 35,000 transcripts in 2024 across vertical software, healthcare services, and industrials.

The second layer is web and behavioral telemetry. SimilarWeb, Semrush, and Sensor Tower (for mobile) provide traffic, keyword rank, and download share data that proxies for customer acquisition momentum. Bombora and 6sense surface buyer-intent signals — which companies are researching the target's category. For a recent vertical-SaaS deal in field service management, a SimilarWeb pull showed the target was losing organic share to two competitors at roughly 200 bps per quarter, a trend not visible in the CIM's revenue growth chart because the company was offsetting with paid acquisition at rising CAC.

The third layer is transaction panel data — Earnest Analytics, Facteus, Bloomberg Second Measure, YipitData, and Consumer Edge. These providers aggregate de-identified card and bank-transaction data covering 3-10 million US consumers (panels vary by provider) and increasingly merchant-side data for B2B. For consumer and consumer-adjacent deals, panel data lets a deal team reconstruct cohort retention, average ticket, frequency, and share-of-wallet without any management cooperation. Placer.ai and SafeGraph add foot-traffic data for physical retail and services.

The fourth layer is review and employer signal data — G2, Capterra, TrustRadius, Gartner Peer Insights, Glassdoor, and Comparably. Review velocity, sentiment delta, and competitive review-stealing are leading indicators of NPS movement. The fifth layer is firmographic and contact data — ZoomInfo, Apollo, Crunchbase, PitchBook, and LinkedIn Sales Navigator — used both for bottom-up TAM construction and for sourcing outreach to non-curated customer references.

The five-layer CDD data stack

Transcript & document intelligence

AlphaSense, Tegus, Stream — expert calls, filings, earnings transcripts indexed with semantic search and LLM summarization.

Web & behavioral telemetry

SimilarWeb, Semrush, Sensor Tower, Bombora, 6sense — traffic share, keyword rank, app downloads, buyer intent.

Transaction panel data

Earnest, Facteus, Second Measure, YipitData, Consumer Edge — card-panel cohort analytics and merchant-level revenue proxies.

Review & employer signal

G2, Capterra, Gartner Peer Insights, Glassdoor, Comparably — sentiment, NPS proxies, attrition risk.

Firmographic & contact

ZoomInfo, Apollo, PitchBook, Crunchbase — TAM construction, off-list reference sourcing, competitor mapping.

Bottom-up TAM/SAM construction

The single most over-cited number in any CIM is TAM. Sellers default to top-down sizing — 'the global field service management market is $5.7 billion growing at 11%' — sourced from a syndicated report that aggregated assumptions from other syndicated reports. Bottom-up sizing flips this. The deal team starts with a unit-of-account definition (e.g., 'US HVAC contractors with 5-50 technicians'), pulls the universe from ZoomInfo or D&B Hoovers (typically 38,000-42,000 firms in this example), applies an addressable filter (uses dispatch software: roughly 55% based on Bombora and survey data), applies an ACV band ($4,800-$18,000 from competitor pricing pages and G2 reviews), and arrives at a defensible SAM of $180-310 million.

Bottom-up SAM

SAM = Σ (Addressable accounts in segment × Penetration probability × Realized ACV × Multi-product attach)

Each variable should be sourced independently — accounts from firmographic data, penetration from intent/install data, ACV from competitor pricing scrapes, attach from G2 module reviews. Triangulate against top-down only as a sanity check, never as the primary number.

The output is not just a number — it's a segment map. A bottom-up build reveals which 8,000 of the 42,000 accounts are concentrated in three states with high HVAC density, which 1,200 are already on a competitor (and therefore replacement opportunities rather than greenfield), and which 6,000 fall outside the target's typical deal motion. This level of segmentation rarely appears in a consultant CDD deck because it requires structured firmographic data the consultants don't license. It's also the input the post-close 100-day plan needs — see Article 5 on integration planning.

$180-310MDefensible SAM range for a US mid-market HVAC dispatch software thesis, built bottom-up from 42,000 firmographic records vs. a $5.7B top-down figure cited in the CIM

Unit economics reconstruction without management cooperation

The hardest part of CDD is validating the unit economics the management team presents. CIMs typically include a 'cohort retention' chart, an LTV/CAC ratio, and a payback period. All three are easy to manipulate. Cohort charts can be smoothed by excluding the most recent (worst-retaining) cohorts under the rationale that they are 'immature.' LTV calculations conveniently assume gross-margin expansion. CAC excludes founder time and a portion of paid acquisition spend.

Automated CDD reconstructs these metrics from external data wherever possible. For consumer subscription businesses, Earnest Analytics or Second Measure panel data can produce 12-, 24-, and 36-month dollar retention curves at the SKU level. We ran this on a meal-kit deal in 2024 where management showed 78% 12-month dollar retention; the Earnest panel reconstruction showed 61%, and the cohort decay had accelerated in the prior six months. The deal repriced by 1.4x EBITDA.

For B2B SaaS, transcript libraries and LinkedIn departure data substitute for direct cohort access. Tracking customer logos from case studies and press releases against current customer lists (often visible on a target's website or scrapeable from G2) lets you compute logo retention. A clean rule of thumb: if 30%+ of named case-study customers from 24+ months ago no longer appear in current marketing materials, gross logo retention is likely below 85%. Combine with employee LinkedIn departures from named customer accounts as a leading indicator of churn risk.

⚠️Where panel data lies

Card-panel coverage skews toward US consumers aged 25-55 with credit-card-heavy spend. For deals targeting cash-heavy demographics, international markets, or B2B-only revenue, panel-based revenue reconstruction can be off by 30-50%. Always validate panel coverage ratio against reported revenue for at least four trailing quarters before using panel curves to challenge management cohorts. If the panel covers <8% of reported revenue, treat directionally only.

Voice of customer at scale

Traditional CDD includes 8-15 customer reference calls, all curated by the seller. Automated workflows expand this in two directions. First, structured review mining: pulling every G2, Capterra, and TrustRadius review for the target and its top five competitors, then running sentiment and topic extraction. A typical mid-market SaaS target has 200-800 reviews; the top three competitors combined will have 2,000-6,000. LLM-based topic clustering surfaces the specific feature gaps, pricing complaints, and switching triggers that drive churn — at a level of granularity 12 reference calls cannot produce.

Second, off-list outreach: using LinkedIn Sales Navigator and Apollo to identify former customers (people who held buyer-side roles at named customers but have since changed jobs) and current non-customers in the ICP. A two-week outreach campaign typically yields 15-25 unscripted conversations, half of which are with churned customers or evaluators who chose a competitor. The win/loss intelligence from these is materially more useful than seller-curated love letters.

“We re-priced a vertical SaaS deal by 22% after the off-list outreach surfaced that three of the four largest 'reference' customers were in active RFP processes to replace the platform. None of that was in the CDD the seller's banker had commissioned.”

— Operating Partner, mid-market growth equity fund

Competitive intelligence and pricing benchmarks

Competitive maps in consulting CDDs are usually 2x2 grids with vendor logos placed based on management's view. Automated competitive intelligence is built from public artifacts: pricing pages (scraped weekly, with diff tracking), product release notes, hiring patterns from LinkedIn, podcast and conference appearances, and patent filings. A weekly diff of competitor pricing pages over a 12-month period frequently reveals strategic shifts — a competitor moving from per-seat to consumption pricing, dropping a free tier, or introducing an enterprise SKU — that signal where the market is heading.

Hiring data is particularly underused. A competitor opening 12 enterprise AE roles in the Northeast while the target's hiring is flat is a leading indicator of market-share movement that won't show in revenue for 6-9 months. LinkedIn Talent Insights, Revelio Labs, and Live Data Technologies (now part of Datos) provide structured hiring and attrition feeds that PE deal teams can correlate to revenue trajectories.

💡Did You Know?

Revelio Labs aggregates the public LinkedIn profiles of roughly 1 billion individuals into a structured workforce database. For a recent CDD on a logistics software target, headcount-weighted attrition at the top three competitors had jumped from 14% to 22% over four quarters — a signal the target's organic growth tailwind was largely a function of competitor instability, not product superiority.

The end-to-end workflow: two weeks instead of eight

14-day automated CDD sprint

Days 1-2: Hypothesis structuring

Deal team translates investment thesis into 8-12 testable claims. Each claim is mapped to specific data sources and falsification criteria before any work begins.

Days 3-5: Bottom-up TAM/SAM build

Firmographic universe pulled, segmented, intent-filtered. Competitor pricing pages scraped, ACV bands triangulated. Output: defensible SAM with segment map.

Days 4-8: Unit economics reconstruction

Panel data licensed for target and 3-5 competitors. Cohort curves, ticket size, frequency reconstructed and compared to management figures. Variance flagged.

Days 6-10: Voice of customer

Review mining across G2, Capterra, TrustRadius, Gartner Peer Insights. Off-list outreach launched via Apollo to former buyers and churned customers.

Days 9-12: Targeted expert calls

8-12 expert calls via Tegus/GLG focused on hypotheses not resolvable from data alone. Each call is hypothesis-led, not exploratory.

Days 11-14: Synthesis and IC memo

Findings synthesized into IC memo with live dashboards. Each claim links to source data. Outputs handed off to <a href=\"/in-focus/value-creation-through-technology-pe-portfolio-operating-model/financial-due-diligence-automated-quality-of-earnings-analysis\">QoE workstream</a> and value-creation planning.

The sequencing matters. Teams that start with expert calls before data work end up with directional anchors that bias the data interpretation. Teams that start with data and use expert calls only to resolve specific unresolved questions get sharper outputs in less time. The 8-12 hypothesis-led expert calls in days 9-12 are typically more productive than the 20-25 exploratory calls in a traditional CDD because each call has a defined question.

Cost and ROI of an in-house CDD platform

A PE firm running 60-100 active diligence processes per year can justify an internal CDD platform on cost alone. Annual data subscriptions for a full stack (AlphaSense, Tegus, SimilarWeb, ZoomInfo, G2 data, one panel provider, one workforce data provider) run $800K-$1.4M. A team of 4-6 people (a data lead, two analysts, a data engineer, and a workflow owner) costs $1.5-2.2M loaded. Total annual cost: $2.3-3.6M.

Against this, a firm previously spending $25-40M on external CDD that shifts 60-70% of work in-house saves $12-20M per year net of platform costs. Equally important, the diligence outputs become reusable — the TAM build for the deal you walked away from becomes the starting point for the next thesis in the same vertical, and a feed into add-on identification for existing portfolio companies.

Annual CDD spend: traditional vs. hybrid in-house model ($M, 80 deals/yr)

Where automation still fails

Automation is weakest on three dimensions. First, qualitative regulatory and reimbursement nuance — particularly in healthcare services, where a 10-minute call with a former CMS policy lead can surface risks no data source captures. Second, channel and distribution complexity — multi-tier industrial distribution, GPO dynamics in healthcare, or dealer-network economics in automotive aftermarket rarely have clean external data signatures. Third, technology depth — assessing whether a target's tech stack will scale or collapse under integration load is a separate workstream covered in Article 4 on technology due diligence.

When to keep using an external consultant

Deal size above $500M EV where time pressure is lower and a branded report supports LP confidence Regulated industries (healthcare services, financial services, regulated industrials) where policy expertise is the binding constraint International deals where panel and firmographic data coverage is thin (most of LATAM, parts of APAC, MENA) First deals in a new vertical where the firm lacks pattern recognition and needs to build a mental model Situations where a strategic investor or LP co-invest requires third-party validation

The point of automation isn't to eliminate consultants. It's to stop paying consultants to do bottom-up TAM builds and cohort math that data engineers can do in three days for 10% of the cost.
— Head of Portfolio Operations, $18B AUM PE firm

Building the operating model

Firms that have made this transition successfully — Vista Equity Partners, Thoma Bravo, TA Associates, Bain Capital Tech, Hg, and several large credit-platform sponsors — share a common structure. A central data and insights team of 6-15 people sits between deal teams and external vendors. Deal teams brief the central team at IOI stage with a structured hypothesis template. The central team executes the data workstreams in parallel with the deal team's commercial work. External consultants are engaged selectively for the 15-25% of analysis that requires deep sector or regulatory expertise.

The cultural shift is harder than the technical one. Deal partners trained on Bain decks instinctively trust a 120-slide consulting output more than a 30-page memo with live dashboards, even when the memo contains better evidence. Firms that have made this work invest deliberately in IC training, require deal teams to defend every claim against source data, and increasingly tie a portion of partner compensation to post-close revenue and margin variance against CDD projections. Over a 24-month window, that variance is the only honest scorecard for CDD quality — automated or otherwise.

Frequently Asked Questions

How much of commercial due diligence can realistically be automated today?

About 60-75% of a typical mid-market CDD workstream — bottom-up TAM/SAM, cohort reconstruction, competitive pricing, review mining, and win/loss analysis — can be automated or heavily data-led. The remaining 25-40%, particularly regulatory nuance, channel complexity, and category creation theses, still benefits from human expert input.

What is the minimum deal volume to justify an internal CDD platform?

Roughly 40-50 active diligence processes per year. Below that, the $2.3-3.6M annual cost of subscriptions, data engineering, and a dedicated team is hard to amortize. Firms under that threshold typically get most of the benefit by subscribing to AlphaSense, Tegus, and SimilarWeb and partnering with specialist data-led CDD providers rather than building internally.

Which alternative data sources are most useful for B2B SaaS diligence specifically?

G2, Capterra, and Gartner Peer Insights for review and sentiment; Bombora and 6sense for buyer intent; LinkedIn Sales Navigator and Revelio Labs for hiring and attrition signals at customers and competitors; Tegus and AlphaSense for transcripts. Card-panel data is rarely useful for B2B SaaS — the panels don't have meaningful coverage of corporate purchasing.

How do we validate that panel-based cohort reconstructions are reliable?

Compute the panel's coverage ratio — total panel-observed revenue divided by reported revenue — for at least four trailing quarters. If coverage is stable at 8%+ and the panel-observed growth rate tracks reported growth within 200-300 bps, the cohort curves are usable. If coverage is below 8% or unstable, use panel data directionally and rely on management cohorts with independent triangulation.

Does automated CDD reduce the need for expert calls?

It reduces the volume but increases the value of each call. Traditional CDDs run 20-25 exploratory calls; automated workflows typically run 8-12 hypothesis-led calls in the back half of the sprint, where each call is designed to resolve a specific question the data couldn't answer. Total spend on expert networks falls 40-60% but yield per call improves materially.