When a single Tuesday print of a credit card panel can move a $40B consumer name 3-5% intraday, the question for a CIO is no longer whether alternative data belongs in the stack — it is whether the pipeline can ingest, normalize, and signal-score the feed before the rest of the street catches up. Hedge fund spend on alternative data reached roughly $4.5B globally in 2025 and is on track to cross $5B in 2026, according to AlternativeData.org and Neudata tracking. The marginal dollar is no longer buying the data itself; it is buying the engineering needed to make the data usable inside a backtest, a live portfolio, and a Form PF disclosure.
This article walks through the four dominant alt data families — transaction (credit/debit card), geo-location and foot traffic, satellite and geospatial imagery, and text-based sentiment — and the reference architecture that converts them into tradable signals. It assumes the modular foundation described in From Monolith to Modular is in place, and it sets up the simulation and validation infrastructure covered in Backtesting at Scale.
The Four Pillars and What They Actually Predict
Each alt data family has a distinct latency profile, coverage bias, and decay curve. Treating them as interchangeable inputs to a generic feature store is the most common architectural mistake we see in implementations. A credit card panel from Facteus or YipitData arrives T+2 to T+5 with merchant-level resolution but skews toward debit and prepaid cards. Second Measure (now Bloomberg Second Measure) leans more heavily on credit. Earnest Analytics weights toward higher-income consumers. The same ticker — say Chipotle (CMG) — can show a 200-300 bps difference in same-store sales nowcast across panels in a given quarter, and the spread itself is signal.
Geo-location data from SafeGraph, Advan, Veraset, and Placer.ai uses mobile SDK ping data to estimate foot traffic at brand-store-day resolution. Post-Apple ATT (April 2021) and the deprecation of mobile ad IDs, panel sizes have compressed roughly 30-40%, and panel stability has become a first-order concern. A traffic 'decline' at a retailer can be a panel artifact rather than a fundamental change. Survivor-bias correction and panel re-weighting now consume more pipeline engineering time than the raw feature extraction.
Satellite imagery from Planet Labs (PlanetScope, ~3m resolution, daily revisit on the full landmass), Maxar (30cm sub-meter for tasked collects), and Orbital Insight or RS Metrics for derived products gives counts of cars in parking lots, oil in floating-roof tanks, container throughput at ports, and crop yield estimates. The economics have shifted: a year of derived parking-lot counts for the top 50 US retailers runs $150K-$400K, versus $1.5M-$3M to build the computer vision pipeline in-house from raw tiles.
Sentiment and text data — RavenPack, Bloomberg, Refinitiv MarketPsych, Accern, AlphaSense — overlap heavily with the techniques covered in NLP on Earnings Calls and 10-Ks. The distinction here is breadth: a sentiment pipeline at the alt data layer pulls news wires, social, Reddit, regulatory filings, glassdoor reviews, and patent filings into a unified entity-mapped feed, typically with 50-200 ms latency from publication to scored signal.
| Data Family | Typical Latency | Annual Cost (Mid-Tier Fund) | Primary Decay Risk |
|---|---|---|---|
| Credit/Debit Card Panels | T+2 to T+7 | $300K-$1.5M per provider | Panel composition drift, issuer loss |
| Geo-Location / Foot Traffic | T+1 to T+3 | $150K-$600K | SDK deprecation, ATT/GDPR shrinkage |
| Satellite Imagery (derived) | T+0 to T+2 (weather-dependent) | $200K-$800K | Cloud cover, commoditization |
| Sentiment / News / Social | 50ms - 5 sec | $100K-$500K | Model overfit, narrative crowding |
Reference Architecture: From Vendor Drop to Live Signal
A production alt data pipeline at a mid-to-large hedge fund typically organizes into five layers: ingestion, normalization, entity resolution, feature engineering, and signal delivery. The reference stack we deploy uses S3 or GCS as the landing zone, Apache Airflow or Dagster for orchestration, a Delta Lake or Apache Iceberg table format for versioning, Spark or Polars for transformation, and Snowflake or Databricks SQL for analyst-facing access. The bill of materials matters less than the discipline around three things: schema contracts with vendors, point-in-time correctness, and entity mapping.
Ingestion fails more often than people admit. Roughly 15-25% of vendor deliveries in any given month arrive late, malformed, or with silent schema drift — a column renamed, a currency unit changed, a date format flipped from ISO to US. Pipelines that catch this at ingest (via Great Expectations, Soda, or Monte Carlo data observability) avoid the worst failure mode in alt data: a quietly broken feature that contaminates a live signal for three weeks before anyone notices the Sharpe collapse.
Entity resolution is where most in-house pipelines stall. A credit card transaction at 'CMG #2847 DENVER' must map to Chipotle Mexican Grill (CMG US) at the security level, to its parent entity, to its sector, and to its geographic exposure bucket. Vendors like FactSet Concordance, OpenFIGI, and PermID handle the security side; the merchant-to-issuer mapping is the hard part and is often vendor-provided but worth auditing. We have seen mapping error rates of 3-8% on long-tail merchants in raw card panels, which translates directly into noise in same-store-sales nowcasts.
Feature engineering for alt data is dominated by panel normalization. The raw daily spend at a retailer is nearly useless; what matters is the year-over-year growth rate of same-panel, same-store spend, adjusted for panel composition changes, seasonality, calendar effects (Easter shift, leap years, fiscal calendar misalignment), and known one-offs (hurricanes, store closures, promotional events). A well-engineered nowcast typically explains 60-75% of variance in reported quarterly same-store sales for consumer discretionary names — enough to make the residual the actual tradable signal.
The Vendor Economics and Build-vs-Buy Math
A typical mid-sized systematic fund running $2-5B in equity strategies will spend $3M-$8M annually on alt data subscriptions across 15-30 vendors, plus another $2M-$4M on the engineering team (5-12 FTEs covering data engineering, quant research support, and vendor management). The ratio of data spend to engineering spend is informative: under 1:1 usually means the fund is overspending on data it cannot operationalize; over 3:1 usually means it is leaving signal on the table by under-investing in the pipeline.
On build-vs-buy, the rule of thumb that has held up across implementations: buy the raw or lightly-processed feed, build the proprietary feature transformations. Paying RavenPack for entity-tagged news at $200K-$400K per year is dramatically cheaper than building NLP infrastructure to scrape, dedupe, and tag global wires. But buying a pre-built 'retail same-store-sales signal' for $500K is paying for a feature that 40 other funds also bought — the alpha has been arbitraged before you sign the contract.
Buy the raw feed. Build the feature. Pay for exclusivity only when you can prove the alpha survives the cost.
— Engagement principle, alt data infrastructure builds
Exclusivity deals — where a fund pays $1M-$5M for a 6-12 month exclusive window on a new dataset — have grown roughly 3x in volume between 2022 and 2025 based on BattleFin and Eagle Alpha discovery platform data. The economics work only if the fund can deploy the data in production within 4-8 weeks of signing. We have audited exclusivity contracts where the fund had not finished onboarding the data when the exclusivity window expired. The contract value in that case is zero.
MNPI, Web Scraping, and the Regulatory Perimeter
The SEC's 2021 settlement with App Annie (now data.ai) for $10M established that alt data vendors can themselves be the source of securities fraud charges, and that funds consuming the data bear diligence responsibility. The 2024 SEC enforcement priorities, restated in the 2025 examination letter, list alternative data sourcing under 'information barriers and MNPI controls.' For European funds, the MAR (Market Abuse Regulation) Article 7 definition of inside information applies, and GDPR Article 6 requires lawful basis for processing personal data — relevant when geo-location feeds derive from individual device pings.
The hVerify ruling (hiQ Labs v. LinkedIn, ultimately settled in 2022) left web scraping in a gray zone where public data is scrapable but ToS violations can ground breach of contract claims. Funds that scrape directly are increasingly rare; the legal risk has shifted to vendors who scrape on behalf of clients. The vendor contract should contain explicit reps that data was collected lawfully, with consent where required, and that the fund will not be a joint tortfeasor in any future claim. Compliance teams should map every dataset to a risk tier and an information barrier policy, the same way they handle expert network calls.
Beyond MNPI, the operational compliance load includes Form PF Section 2 (large hedge fund advisers must report on data sources used in investment decisions in qualitative terms), AIFMD Annex IV (for EU funds, risk reporting that increasingly references data inputs), and the CFTC's evolving stance on alt data in commodity strategies. These tie into the broader regulatory automation discussed in Regulatory Reporting.
Signal Validation and Alpha Decay
The graveyard of alt data is paved with backtests that worked beautifully out-of-sample on paper and earned 30 bps of negative alpha live. Three failure modes dominate. First, look-ahead bias from non-point-in-time data — fixed by bitemporal storage. Second, survivorship bias in the panel itself — the merchants in a card panel today are not the merchants three years ago, and naive analysis treats the historical panel as if it had the current composition. Third, crowding — when a signal becomes widely known, the alpha decays into transaction costs.
Half-life measurement should be standard practice. A signal with a 24-month rolling Sharpe that drops from 1.8 to 0.6 over 18 months is not 'underperforming' — it is decaying, and the question is whether to retire it or transform it (combine with other signals, change the holding horizon, restrict to a less crowded universe). Funds that monitor alpha decay quantitatively, with automated alerts when rolling Sharpe drops below 50% of the in-sample peak, retire roughly 25-35% of alt data signals each year and replace them with new combinations.
Combining signals across families typically produces longer-lived alpha than any single source. A composite that blends card spend nowcasts with foot traffic, weighted by panel reliability, and overlaid with sentiment-derived 'narrative risk' for the same name, has empirically shown 1.4-1.8x the half-life of any component signal. The infrastructure to do this — feature stores like Tecton or Feast, with versioned features and lineage — has become standard at funds running more than $1B in systematic strategies.
Operating Model and Team Structure
The team structures we see succeed share a common pattern: a dedicated alt data lead reports to the CIO or head of research (not to IT), with three pods underneath. Vendor management and sourcing (1-2 FTEs) handles discovery, contracts, and the BattleFin/Eagle Alpha/Neudata pipeline. Data engineering (3-6 FTEs) owns ingestion, normalization, entity resolution, and the feature store. Quant research support (2-4 FTEs) sits adjacent to research and translates new datasets into research-ready feature sets within a 2-4 week SLA.
The cultural failure mode is the 'data hoarding' anti-pattern, where the alt data team's KPI is volume of data onboarded rather than alpha generated. We have audited shops with 60-100 active subscriptions where fewer than 15 were tied to any live strategy. The corrective is a quarterly review where every dataset is mapped to either (a) a live strategy with attributed P&L, (b) an active research project with a deadline, or (c) a deprecation timeline. Datasets that fall into none of the three are cut at renewal.
Cloud landing zone, orchestration (Airflow/Dagster), bitemporal table format (Iceberg/Delta), initial 2-3 vendor onboardings as reference implementations.
Expand to 10-15 vendors, build entity resolution layer, deploy data observability (Monte Carlo or equivalent), formalize point-in-time discipline.
Stand up feature store (Tecton/Feast), integrate with backtesting platform, deploy 3-5 production signals with attributed P&L.
Implement signal decay monitoring, quarterly vendor review process, exclusivity deal evaluation framework, MNPI compliance integration.
Where the Next 24 Months Are Heading
Three shifts will define alt data infrastructure investment through 2027. First, the cost-per-feature is falling as foundation models reduce the engineering burden for unstructured data — earnings call transcripts, satellite tiles, and social text are increasingly processed through fine-tuned LLMs rather than bespoke NLP or CV pipelines. We are seeing 50-70% reductions in feature engineering time for text-based signals at funds that have integrated open-weight models (Llama 3.1/4, Mistral Large) into their pipelines. Second, alt data is moving toward private datasets — point-of-sale feeds direct from retailers, payments data direct from networks under aggregation agreements — which carries higher cost and higher MNPI scrutiny but lower crowding.
Third, the line between alt data and core market data is dissolving. Bloomberg's acquisition of Second Measure (2024), FactSet's continued build-out of Open:FactSet, and S&P's Visible Alpha integration mean that what was 'alternative' in 2018 is becoming 'standard' by 2027. Funds whose competitive moat depended on having access to consumer transaction data will need to move up the value chain — to private data, to proprietary feature engineering, or to faster signal cycling. The infrastructure choices made in 2026 will determine which side of that line a fund ends up on.
The alt data pipeline is no longer a side project for a quant team or a discretionary fund's data science group. It is a P&L-bearing operational system that must meet the same uptime, observability, and compliance standards as the order management system or the risk engine. Funds that treat it that way generate 50-150 bps of attributable alpha from alt data inputs in normal market regimes. Funds that treat it as a research toy generate marketing materials.