A mid-market buyout shop running 80-120 deals through IC each year typically spends $25-40 million annually on commercial due diligence — Bain, LEK, OC&C, EY-Parthenon, and a long tail of specialist boutiques. The average engagement runs 6-10 weeks at $400K-$1.2M, and roughly 70% of deals that receive a green-light CDD never close. That's $15-25 million per year of effective burn on transactions the firm walks away from. The economics of CDD have not changed materially since the early 2000s, but the underlying data inputs have. Bottom-up market sizing, cohort retention curves, win/loss analysis, and pricing benchmarks can now be assembled in days from alternative data, transcript libraries, and panel sources that did not exist five years ago.
This article is the second in our PE Portfolio Operating Model series. Article 1 covered upstream screening at the funnel-top. Here we focus on the next stage: turning a shortlist of 5-15 active targets per quarter into IC-ready commercial theses using automation. We'll cover the data stack, the modeling techniques, the unit-economics workflows, and the specific failure modes deal teams hit when they over-rely on machine outputs.
Why traditional CDD breaks at mid-market velocity
The classic CDD deliverable — a 120-slide deck with a market map, growth drivers, customer references, and a competitive landscape — was designed for $500M+ enterprise value deals where a six-week sprint is proportionate to the check size. In sub-$200M deals, which now represent the majority of US sponsor activity, that timeline is incompatible with auction dynamics. Sellers routinely demand final bids within 4-5 weeks of CIM release. Sponsors that wait for a finished consulting report either submit on instinct or lose the deal.
The deeper problem is that consultant CDDs rely on three primary inputs: (1) 15-25 expert calls via GLG, Guidepoint, or Third Bridge, (2) syndicated reports from IBISWorld, Frost & Sullivan, or Gartner, and (3) a customer reference list provided by the seller. Each input is biased. Expert networks oversample former employees willing to talk for $400/hour. Syndicated reports recycle 18-month-old data. Seller-provided references are curated. The output reflects the inputs — a directionally correct but rarely surprising market view.
| Dimension | Traditional CDD (Bain/LEK/OC&C) | Automation-led CDD |
|---|---|---|
| Duration | 6-10 weeks | 2-3 weeks |
| Cost per deal | $400K-$1.2M | $60K-$250K (incl. data subscriptions) |
| Expert calls | 15-25 (network-curated) | 8-12 (targeted, post-data hypothesis) |
| Market sizing approach | Top-down from syndicated reports | Bottom-up from panel and transaction data |
| Customer voice | Seller-curated reference list | NPS, G2 reviews, churn proxies from panel |
| Cohort economics | Provided by management | Reconstructed from card-panel or app data |
| Updateability | Static deck | Live dashboard, refreshed weekly |
The automated CDD data stack
A working automation stack has five layers. The first is transcript and document intelligence — AlphaSense, Tegus, and Stream by AlphaSense (formerly Sentieo) together index over 150,000 expert call transcripts, every public 10-K and earnings call, and a growing library of private-company expert calls. A deal team running a B2B SaaS thesis can query 'NetSuite displacement in mid-market' and surface 200+ transcripts in under a minute. Tegus alone added roughly 35,000 transcripts in 2024 across vertical software, healthcare services, and industrials.
The second layer is web and behavioral telemetry. SimilarWeb, Semrush, and Sensor Tower (for mobile) provide traffic, keyword rank, and download share data that proxies for customer acquisition momentum. Bombora and 6sense surface buyer-intent signals — which companies are researching the target's category. For a recent vertical-SaaS deal in field service management, a SimilarWeb pull showed the target was losing organic share to two competitors at roughly 200 bps per quarter, a trend not visible in the CIM's revenue growth chart because the company was offsetting with paid acquisition at rising CAC.
The third layer is transaction panel data — Earnest Analytics, Facteus, Bloomberg Second Measure, YipitData, and Consumer Edge. These providers aggregate de-identified card and bank-transaction data covering 3-10 million US consumers (panels vary by provider) and increasingly merchant-side data for B2B. For consumer and consumer-adjacent deals, panel data lets a deal team reconstruct cohort retention, average ticket, frequency, and share-of-wallet without any management cooperation. Placer.ai and SafeGraph add foot-traffic data for physical retail and services.
The fourth layer is review and employer signal data — G2, Capterra, TrustRadius, Gartner Peer Insights, Glassdoor, and Comparably. Review velocity, sentiment delta, and competitive review-stealing are leading indicators of NPS movement. The fifth layer is firmographic and contact data — ZoomInfo, Apollo, Crunchbase, PitchBook, and LinkedIn Sales Navigator — used both for bottom-up TAM construction and for sourcing outreach to non-curated customer references.
Bottom-up TAM/SAM construction
The single most over-cited number in any CIM is TAM. Sellers default to top-down sizing — 'the global field service management market is $5.7 billion growing at 11%' — sourced from a syndicated report that aggregated assumptions from other syndicated reports. Bottom-up sizing flips this. The deal team starts with a unit-of-account definition (e.g., 'US HVAC contractors with 5-50 technicians'), pulls the universe from ZoomInfo or D&B Hoovers (typically 38,000-42,000 firms in this example), applies an addressable filter (uses dispatch software: roughly 55% based on Bombora and survey data), applies an ACV band ($4,800-$18,000 from competitor pricing pages and G2 reviews), and arrives at a defensible SAM of $180-310 million.
The output is not just a number — it's a segment map. A bottom-up build reveals which 8,000 of the 42,000 accounts are concentrated in three states with high HVAC density, which 1,200 are already on a competitor (and therefore replacement opportunities rather than greenfield), and which 6,000 fall outside the target's typical deal motion. This level of segmentation rarely appears in a consultant CDD deck because it requires structured firmographic data the consultants don't license. It's also the input the post-close 100-day plan needs — see Article 5 on integration planning.
Unit economics reconstruction without management cooperation
The hardest part of CDD is validating the unit economics the management team presents. CIMs typically include a 'cohort retention' chart, an LTV/CAC ratio, and a payback period. All three are easy to manipulate. Cohort charts can be smoothed by excluding the most recent (worst-retaining) cohorts under the rationale that they are 'immature.' LTV calculations conveniently assume gross-margin expansion. CAC excludes founder time and a portion of paid acquisition spend.
Automated CDD reconstructs these metrics from external data wherever possible. For consumer subscription businesses, Earnest Analytics or Second Measure panel data can produce 12-, 24-, and 36-month dollar retention curves at the SKU level. We ran this on a meal-kit deal in 2024 where management showed 78% 12-month dollar retention; the Earnest panel reconstruction showed 61%, and the cohort decay had accelerated in the prior six months. The deal repriced by 1.4x EBITDA.
For B2B SaaS, transcript libraries and LinkedIn departure data substitute for direct cohort access. Tracking customer logos from case studies and press releases against current customer lists (often visible on a target's website or scrapeable from G2) lets you compute logo retention. A clean rule of thumb: if 30%+ of named case-study customers from 24+ months ago no longer appear in current marketing materials, gross logo retention is likely below 85%. Combine with employee LinkedIn departures from named customer accounts as a leading indicator of churn risk.
Voice of customer at scale
Traditional CDD includes 8-15 customer reference calls, all curated by the seller. Automated workflows expand this in two directions. First, structured review mining: pulling every G2, Capterra, and TrustRadius review for the target and its top five competitors, then running sentiment and topic extraction. A typical mid-market SaaS target has 200-800 reviews; the top three competitors combined will have 2,000-6,000. LLM-based topic clustering surfaces the specific feature gaps, pricing complaints, and switching triggers that drive churn — at a level of granularity 12 reference calls cannot produce.
Second, off-list outreach: using LinkedIn Sales Navigator and Apollo to identify former customers (people who held buyer-side roles at named customers but have since changed jobs) and current non-customers in the ICP. A two-week outreach campaign typically yields 15-25 unscripted conversations, half of which are with churned customers or evaluators who chose a competitor. The win/loss intelligence from these is materially more useful than seller-curated love letters.
Competitive intelligence and pricing benchmarks
Competitive maps in consulting CDDs are usually 2x2 grids with vendor logos placed based on management's view. Automated competitive intelligence is built from public artifacts: pricing pages (scraped weekly, with diff tracking), product release notes, hiring patterns from LinkedIn, podcast and conference appearances, and patent filings. A weekly diff of competitor pricing pages over a 12-month period frequently reveals strategic shifts — a competitor moving from per-seat to consumption pricing, dropping a free tier, or introducing an enterprise SKU — that signal where the market is heading.
Hiring data is particularly underused. A competitor opening 12 enterprise AE roles in the Northeast while the target's hiring is flat is a leading indicator of market-share movement that won't show in revenue for 6-9 months. LinkedIn Talent Insights, Revelio Labs, and Live Data Technologies (now part of Datos) provide structured hiring and attrition feeds that PE deal teams can correlate to revenue trajectories.
The end-to-end workflow: two weeks instead of eight
Deal team translates investment thesis into 8-12 testable claims. Each claim is mapped to specific data sources and falsification criteria before any work begins.
Firmographic universe pulled, segmented, intent-filtered. Competitor pricing pages scraped, ACV bands triangulated. Output: defensible SAM with segment map.
Panel data licensed for target and 3-5 competitors. Cohort curves, ticket size, frequency reconstructed and compared to management figures. Variance flagged.
Review mining across G2, Capterra, TrustRadius, Gartner Peer Insights. Off-list outreach launched via Apollo to former buyers and churned customers.
8-12 expert calls via Tegus/GLG focused on hypotheses not resolvable from data alone. Each call is hypothesis-led, not exploratory.
Findings synthesized into IC memo with live dashboards. Each claim links to source data. Outputs handed off to <a href=\"/in-focus/value-creation-through-technology-pe-portfolio-operating-model/financial-due-diligence-automated-quality-of-earnings-analysis\">QoE workstream</a> and value-creation planning.
The sequencing matters. Teams that start with expert calls before data work end up with directional anchors that bias the data interpretation. Teams that start with data and use expert calls only to resolve specific unresolved questions get sharper outputs in less time. The 8-12 hypothesis-led expert calls in days 9-12 are typically more productive than the 20-25 exploratory calls in a traditional CDD because each call has a defined question.
Cost and ROI of an in-house CDD platform
A PE firm running 60-100 active diligence processes per year can justify an internal CDD platform on cost alone. Annual data subscriptions for a full stack (AlphaSense, Tegus, SimilarWeb, ZoomInfo, G2 data, one panel provider, one workforce data provider) run $800K-$1.4M. A team of 4-6 people (a data lead, two analysts, a data engineer, and a workflow owner) costs $1.5-2.2M loaded. Total annual cost: $2.3-3.6M.
Against this, a firm previously spending $25-40M on external CDD that shifts 60-70% of work in-house saves $12-20M per year net of platform costs. Equally important, the diligence outputs become reusable — the TAM build for the deal you walked away from becomes the starting point for the next thesis in the same vertical, and a feed into add-on identification for existing portfolio companies.
Where automation still fails
Automation is weakest on three dimensions. First, qualitative regulatory and reimbursement nuance — particularly in healthcare services, where a 10-minute call with a former CMS policy lead can surface risks no data source captures. Second, channel and distribution complexity — multi-tier industrial distribution, GPO dynamics in healthcare, or dealer-network economics in automotive aftermarket rarely have clean external data signatures. Third, technology depth — assessing whether a target's tech stack will scale or collapse under integration load is a separate workstream covered in Article 4 on technology due diligence.
The point of automation isn't to eliminate consultants. It's to stop paying consultants to do bottom-up TAM builds and cohort math that data engineers can do in three days for 10% of the cost.
— Head of Portfolio Operations, $18B AUM PE firm
Building the operating model
Firms that have made this transition successfully — Vista Equity Partners, Thoma Bravo, TA Associates, Bain Capital Tech, Hg, and several large credit-platform sponsors — share a common structure. A central data and insights team of 6-15 people sits between deal teams and external vendors. Deal teams brief the central team at IOI stage with a structured hypothesis template. The central team executes the data workstreams in parallel with the deal team's commercial work. External consultants are engaged selectively for the 15-25% of analysis that requires deep sector or regulatory expertise.
The cultural shift is harder than the technical one. Deal partners trained on Bain decks instinctively trust a 120-slide consulting output more than a 30-page memo with live dashboards, even when the memo contains better evidence. Firms that have made this work invest deliberately in IC training, require deal teams to defend every claim against source data, and increasingly tie a portion of partner compensation to post-close revenue and margin variance against CDD projections. Over a 24-month window, that variance is the only honest scorecard for CDD quality — automated or otherwise.