In Focus/Systematic Alpha: Technology Stack for the Modern Hedge Fund

Hedge Funds — Article 2 of 12

Alternative Data Pipelines (Credit Card, Geo-Location, Satellite, Sentiment)

Alternative data has moved from edge experiment to core infrastructure at hedge funds, with global spend approaching $5B in 2026. This deep dive examines the engineering, vendor economics, and compliance architecture required to convert credit card panels, mobile geo-location, satellite imagery, and sentiment feeds into production alpha.

11 min read

Hedge Funds

When a single Tuesday print of a credit card panel can move a $40B consumer name 3-5% intraday, the question for a CIO is no longer whether alternative data belongs in the stack — it is whether the pipeline can ingest, normalize, and signal-score the feed before the rest of the street catches up. Hedge fund spend on alternative data reached roughly $4.5B globally in 2025 and is on track to cross $5B in 2026, according to AlternativeData.org and Neudata tracking. The marginal dollar is no longer buying the data itself; it is buying the engineering needed to make the data usable inside a backtest, a live portfolio, and a Form PF disclosure.

This article walks through the four dominant alt data families — transaction (credit/debit card), geo-location and foot traffic, satellite and geospatial imagery, and text-based sentiment — and the reference architecture that converts them into tradable signals. It assumes the modular foundation described in From Monolith to Modular is in place, and it sets up the simulation and validation infrastructure covered in Backtesting at Scale.

$4.5BEstimated global hedge fund and asset manager spend on alternative data in 2025 (Neudata, AlternativeData.org)

The Four Pillars and What They Actually Predict

Each alt data family has a distinct latency profile, coverage bias, and decay curve. Treating them as interchangeable inputs to a generic feature store is the most common architectural mistake we see in implementations. A credit card panel from Facteus or YipitData arrives T+2 to T+5 with merchant-level resolution but skews toward debit and prepaid cards. Second Measure (now Bloomberg Second Measure) leans more heavily on credit. Earnest Analytics weights toward higher-income consumers. The same ticker — say Chipotle (CMG) — can show a 200-300 bps difference in same-store sales nowcast across panels in a given quarter, and the spread itself is signal.

Geo-location data from SafeGraph, Advan, Veraset, and Placer.ai uses mobile SDK ping data to estimate foot traffic at brand-store-day resolution. Post-Apple ATT (April 2021) and the deprecation of mobile ad IDs, panel sizes have compressed roughly 30-40%, and panel stability has become a first-order concern. A traffic 'decline' at a retailer can be a panel artifact rather than a fundamental change. Survivor-bias correction and panel re-weighting now consume more pipeline engineering time than the raw feature extraction.

Satellite imagery from Planet Labs (PlanetScope, ~3m resolution, daily revisit on the full landmass), Maxar (30cm sub-meter for tasked collects), and Orbital Insight or RS Metrics for derived products gives counts of cars in parking lots, oil in floating-roof tanks, container throughput at ports, and crop yield estimates. The economics have shifted: a year of derived parking-lot counts for the top 50 US retailers runs $150K-$400K, versus $1.5M-$3M to build the computer vision pipeline in-house from raw tiles.

Sentiment and text data — RavenPack, Bloomberg, Refinitiv MarketPsych, Accern, AlphaSense — overlap heavily with the techniques covered in NLP on Earnings Calls and 10-Ks. The distinction here is breadth: a sentiment pipeline at the alt data layer pulls news wires, social, Reddit, regulatory filings, glassdoor reviews, and patent filings into a unified entity-mapped feed, typically with 50-200 ms latency from publication to scored signal.

Alt Data Family Characteristics

Data Family	Typical Latency	Annual Cost (Mid-Tier Fund)	Primary Decay Risk
Credit/Debit Card Panels	T+2 to T+7	$300K-$1.5M per provider	Panel composition drift, issuer loss
Geo-Location / Foot Traffic	T+1 to T+3	$150K-$600K	SDK deprecation, ATT/GDPR shrinkage
Satellite Imagery (derived)	T+0 to T+2 (weather-dependent)	$200K-$800K	Cloud cover, commoditization
Sentiment / News / Social	50ms - 5 sec	$100K-$500K	Model overfit, narrative crowding

Reference Architecture: From Vendor Drop to Live Signal

A production alt data pipeline at a mid-to-large hedge fund typically organizes into five layers: ingestion, normalization, entity resolution, feature engineering, and signal delivery. The reference stack we deploy uses S3 or GCS as the landing zone, Apache Airflow or Dagster for orchestration, a Delta Lake or Apache Iceberg table format for versioning, Spark or Polars for transformation, and Snowflake or Databricks SQL for analyst-facing access. The bill of materials matters less than the discipline around three things: schema contracts with vendors, point-in-time correctness, and entity mapping.

Ingestion fails more often than people admit. Roughly 15-25% of vendor deliveries in any given month arrive late, malformed, or with silent schema drift — a column renamed, a currency unit changed, a date format flipped from ISO to US. Pipelines that catch this at ingest (via Great Expectations, Soda, or Monte Carlo data observability) avoid the worst failure mode in alt data: a quietly broken feature that contaminates a live signal for three weeks before anyone notices the Sharpe collapse.

⚠️Point-in-Time Correctness Is Non-Negotiable

If a credit card panel is restated 14 days after initial delivery (which most are), your backtest must see only what was knowable on the original timestamp. Storing data without an 'as-of' or 'knowledge_date' column is the single most common cause of inflated backtest Sharpe ratios in alt data research. Bitemporal tables — valid_time plus knowledge_time — are mandatory, not optional.

Entity resolution is where most in-house pipelines stall. A credit card transaction at 'CMG #2847 DENVER' must map to Chipotle Mexican Grill (CMG US) at the security level, to its parent entity, to its sector, and to its geographic exposure bucket. Vendors like FactSet Concordance, OpenFIGI, and PermID handle the security side; the merchant-to-issuer mapping is the hard part and is often vendor-provided but worth auditing. We have seen mapping error rates of 3-8% on long-tail merchants in raw card panels, which translates directly into noise in same-store-sales nowcasts.

Feature engineering for alt data is dominated by panel normalization. The raw daily spend at a retailer is nearly useless; what matters is the year-over-year growth rate of same-panel, same-store spend, adjusted for panel composition changes, seasonality, calendar effects (Easter shift, leap years, fiscal calendar misalignment), and known one-offs (hurricanes, store closures, promotional events). A well-engineered nowcast typically explains 60-75% of variance in reported quarterly same-store sales for consumer discretionary names — enough to make the residual the actual tradable signal.

Typical Alpha Half-Life by Alt Data Family (Months)

The Vendor Economics and Build-vs-Buy Math

A typical mid-sized systematic fund running $2-5B in equity strategies will spend $3M-$8M annually on alt data subscriptions across 15-30 vendors, plus another $2M-$4M on the engineering team (5-12 FTEs covering data engineering, quant research support, and vendor management). The ratio of data spend to engineering spend is informative: under 1:1 usually means the fund is overspending on data it cannot operationalize; over 3:1 usually means it is leaving signal on the table by under-investing in the pipeline.

On build-vs-buy, the rule of thumb that has held up across implementations: buy the raw or lightly-processed feed, build the proprietary feature transformations. Paying RavenPack for entity-tagged news at $200K-$400K per year is dramatically cheaper than building NLP infrastructure to scrape, dedupe, and tag global wires. But buying a pre-built 'retail same-store-sales signal' for $500K is paying for a feature that 40 other funds also bought — the alpha has been arbitraged before you sign the contract.

Buy the raw feed. Build the feature. Pay for exclusivity only when you can prove the alpha survives the cost.
— Engagement principle, alt data infrastructure builds

Exclusivity deals — where a fund pays $1M-$5M for a 6-12 month exclusive window on a new dataset — have grown roughly 3x in volume between 2022 and 2025 based on BattleFin and Eagle Alpha discovery platform data. The economics work only if the fund can deploy the data in production within 4-8 weeks of signing. We have audited exclusivity contracts where the fund had not finished onboarding the data when the exclusivity window expired. The contract value in that case is zero.

Pre-Contract Vendor Due Diligence

Data lineage and source — is it scraped, panel-derived, or licensed from the originating party with documented consent? Personally identifiable information (PII) handling — has it been hashed, aggregated, k-anonymized? Request the privacy impact assessment. Historical point-in-time data availability — minimum 5-7 years for equity strategies; ask for the original 'as-of' delivery timestamps, not just the value dates. Restatement history — request the full revision log for the trailing 12 months; if vendor cannot provide it, walk away. Panel composition disclosure — coverage by geography, demographics, issuer; updated quarterly with documented change notes. MNPI representation and indemnification — explicit contractual representation that the data does not contain material non-public information. Termination and data destruction terms — what happens to derivative features built on the data if the contract ends?

MNPI, Web Scraping, and the Regulatory Perimeter

The SEC's 2021 settlement with App Annie (now data.ai) for $10M established that alt data vendors can themselves be the source of securities fraud charges, and that funds consuming the data bear diligence responsibility. The 2024 SEC enforcement priorities, restated in the 2025 examination letter, list alternative data sourcing under 'information barriers and MNPI controls.' For European funds, the MAR (Market Abuse Regulation) Article 7 definition of inside information applies, and GDPR Article 6 requires lawful basis for processing personal data — relevant when geo-location feeds derive from individual device pings.

The hVerify ruling (hiQ Labs v. LinkedIn, ultimately settled in 2022) left web scraping in a gray zone where public data is scrapable but ToS violations can ground breach of contract claims. Funds that scrape directly are increasingly rare; the legal risk has shifted to vendors who scrape on behalf of clients. The vendor contract should contain explicit reps that data was collected lawfully, with consent where required, and that the fund will not be a joint tortfeasor in any future claim. Compliance teams should map every dataset to a risk tier and an information barrier policy, the same way they handle expert network calls.

🎯The Three-Question MNPI Screen

Before onboarding any new dataset, compliance should answer three questions in writing: (1) Could this data have been obtained directly from an issuer or its insiders? (2) Is the data aggregated such that no single corporate counterparty's confidential information is recoverable? (3) Does the vendor have a documented consent or licensing chain from the original data subject? A 'no' or 'unclear' on any of the three should trigger legal review before contract signing.

Beyond MNPI, the operational compliance load includes Form PF Section 2 (large hedge fund advisers must report on data sources used in investment decisions in qualitative terms), AIFMD Annex IV (for EU funds, risk reporting that increasingly references data inputs), and the CFTC's evolving stance on alt data in commodity strategies. These tie into the broader regulatory automation discussed in Regulatory Reporting.

Signal Validation and Alpha Decay

The graveyard of alt data is paved with backtests that worked beautifully out-of-sample on paper and earned 30 bps of negative alpha live. Three failure modes dominate. First, look-ahead bias from non-point-in-time data — fixed by bitemporal storage. Second, survivorship bias in the panel itself — the merchants in a card panel today are not the merchants three years ago, and naive analysis treats the historical panel as if it had the current composition. Third, crowding — when a signal becomes widely known, the alpha decays into transaction costs.

Half-life measurement should be standard practice. A signal with a 24-month rolling Sharpe that drops from 1.8 to 0.6 over 18 months is not 'underperforming' — it is decaying, and the question is whether to retire it or transform it (combine with other signals, change the holding horizon, restrict to a less crowded universe). Funds that monitor alpha decay quantitatively, with automated alerts when rolling Sharpe drops below 50% of the in-sample peak, retire roughly 25-35% of alt data signals each year and replace them with new combinations.

💡Did You Know?

Renaissance Technologies' Medallion fund has reportedly cycled through alt data signals at a rate of 30-40% per year for over a decade. Stable alpha at the firm comes from the meta-process of signal discovery and retirement, not from any single dataset surviving long-term.

Combining signals across families typically produces longer-lived alpha than any single source. A composite that blends card spend nowcasts with foot traffic, weighted by panel reliability, and overlaid with sentiment-derived 'narrative risk' for the same name, has empirically shown 1.4-1.8x the half-life of any component signal. The infrastructure to do this — feature stores like Tecton or Feast, with versioned features and lineage — has become standard at funds running more than $1B in systematic strategies.

Operating Model and Team Structure

The team structures we see succeed share a common pattern: a dedicated alt data lead reports to the CIO or head of research (not to IT), with three pods underneath. Vendor management and sourcing (1-2 FTEs) handles discovery, contracts, and the BattleFin/Eagle Alpha/Neudata pipeline. Data engineering (3-6 FTEs) owns ingestion, normalization, entity resolution, and the feature store. Quant research support (2-4 FTEs) sits adjacent to research and translates new datasets into research-ready feature sets within a 2-4 week SLA.

“We stopped measuring our alt data team on number of datasets onboarded and started measuring on number of datasets in live production with non-zero capital allocation. The number dropped from 80 to 23. Our P&L went up.”

— Head of Systematic Research, $6B equity hedge fund

The cultural failure mode is the 'data hoarding' anti-pattern, where the alt data team's KPI is volume of data onboarded rather than alpha generated. We have audited shops with 60-100 active subscriptions where fewer than 15 were tied to any live strategy. The corrective is a quarterly review where every dataset is mapped to either (a) a live strategy with attributed P&L, (b) an active research project with a deadline, or (c) a deprecation timeline. Datasets that fall into none of the three are cut at renewal.

Typical 18-Month Alt Data Capability Build

Months 1-3: Foundation

Cloud landing zone, orchestration (Airflow/Dagster), bitemporal table format (Iceberg/Delta), initial 2-3 vendor onboardings as reference implementations.

Months 4-9: Scale Ingestion

Expand to 10-15 vendors, build entity resolution layer, deploy data observability (Monte Carlo or equivalent), formalize point-in-time discipline.

Months 10-15: Feature Store and Research Workflow

Stand up feature store (Tecton/Feast), integrate with backtesting platform, deploy 3-5 production signals with attributed P&L.

Months 16-18: Optimization and Retirement Discipline

Implement signal decay monitoring, quarterly vendor review process, exclusivity deal evaluation framework, MNPI compliance integration.

Where the Next 24 Months Are Heading

Three shifts will define alt data infrastructure investment through 2027. First, the cost-per-feature is falling as foundation models reduce the engineering burden for unstructured data — earnings call transcripts, satellite tiles, and social text are increasingly processed through fine-tuned LLMs rather than bespoke NLP or CV pipelines. We are seeing 50-70% reductions in feature engineering time for text-based signals at funds that have integrated open-weight models (Llama 3.1/4, Mistral Large) into their pipelines. Second, alt data is moving toward private datasets — point-of-sale feeds direct from retailers, payments data direct from networks under aggregation agreements — which carries higher cost and higher MNPI scrutiny but lower crowding.

Third, the line between alt data and core market data is dissolving. Bloomberg's acquisition of Second Measure (2024), FactSet's continued build-out of Open:FactSet, and S&P's Visible Alpha integration mean that what was 'alternative' in 2018 is becoming 'standard' by 2027. Funds whose competitive moat depended on having access to consumer transaction data will need to move up the value chain — to private data, to proprietary feature engineering, or to faster signal cycling. The infrastructure choices made in 2026 will determine which side of that line a fund ends up on.

The alt data pipeline is no longer a side project for a quant team or a discretionary fund's data science group. It is a P&L-bearing operational system that must meet the same uptime, observability, and compliance standards as the order management system or the risk engine. Funds that treat it that way generate 50-150 bps of attributable alpha from alt data inputs in normal market regimes. Funds that treat it as a research toy generate marketing materials.

Frequently Asked Questions

How much should a $2-5B hedge fund budget for alternative data in 2026?

Total alt data investment typically runs $5M-$12M per year at this AUM, split roughly 60/40 between vendor subscriptions ($3M-$8M across 15-30 providers) and the engineering and research team that operationalizes the data (5-12 FTEs at $400K-$600K fully loaded). Funds spending less than $3M total are usually either smaller or under-invested; over $15M typically requires either much larger AUM or exclusivity deals.

What is the single biggest technical mistake in alt data pipelines?

Failing to store data bitemporally — separating the value date from the knowledge date (when the data became available). Without this, backtests use restated or revised data that was not actually knowable at the time, inflating apparent alpha by 30-100% in our audits. Every alt data table should have a knowledge_date or as_of_date column from the first day of ingestion.

How do we avoid MNPI risk when buying alternative data?

Run a three-question screen before signing: could the data have been sourced from issuer insiders, is it aggregated enough that no single counterparty's confidential information is recoverable, and is there a documented consent or licensing chain from the data subjects. Require contractual MNPI representations and indemnification from the vendor, and treat alt data ingestion as subject to the same information barrier policies as expert network engagements.

Should we build a satellite imagery pipeline in-house or buy derived products?

For all but the largest quant shops, buy the derived products. A year of parking-lot counts, oil tank levels, or port throughput data for the top names in a sector costs $150K-$400K, versus $1.5M-$3M to build the computer vision pipeline plus ongoing model maintenance. Build in-house only when you have identified a specific use case where commercial products are inadequate and the alpha opportunity justifies a multi-year capability investment.

How quickly does alt data alpha decay once a signal goes live?

Single-source signals from widely-distributed datasets typically show alpha half-lives of 6-18 months, with sentiment signals decaying fastest and niche satellite-derived signals decaying slowest. Composite signals combining 3-5 data families can extend half-lives to 20-30 months. Funds that monitor rolling Sharpe ratios and retire signals systematically replace 25-35% of their alt data signal library each year.