Methodology: how we forecast Steam launches
An honest overview of what makes our calibration trustworthy and our counterfactuals causal — without giving away the recipe.
Built by Greg C. — senior software engineer with production ML experience in calibrated prediction (10+ years shipping ML systems where probability claims have to survive contact with real money). Open-source companion tooling: steamforecast-mcp (MCP server for AI assistants) · steam-page-stats (Python scraper). Methodology validation runs daily on the launching-games cohort — see the live Q2 2026 calibration report.
The variance problem with Steam launch estimates
The industry default for estimating Steam launch revenue is the Boxleiter heuristic: roughly 50 sales per Steam review in the original formulation (1990s–2010s), updated to approximately 30–63 sales per review in modern variants. For wishlist-to-sales conversion, the parallel rule of thumb is 0.15–0.30 × wishlist count = first-week sales.
These are point estimates. And that is where the problem starts.
Per GameDiscoverCo’s 2024–2025 wishlist-conversion data, real launches at comparable wishlist counts show 10–20× variance in week-1 revenue. Documented high-end outliers include Peak (266× conversion), Nubby’s Number Factory (140×), Mage Arena (78×), R.E.P.O. (68×), Webfishing (67×). These aren’t data errors—they are features of the real distribution.
When the underlying distribution spans four or more orders of magnitude, a point estimate is structurally misleading. Publishing “$80K expected” when the honest answer is “$20K–$400K with 80% probability” doesn’t simplify the problem—it hides it. The case for risk-quantification isn’t that better models will narrow the interval to a fine point. The true interval is wide. You need to know that before committing marketing budget.
Calibrated risk ranges with coverage guarantees
We publish a P10–P90 cone that’s empirically calibrated: when we say 80% coverage, we’ve validated on held-out historical launches that the interval really does contain the true outcome 80% of the time. That validation runs every time the model retrains, and it’s the gate that decides whether a new model ships.
The methodology family is distribution-free—no Gaussian assumption, no parametric form for revenue. The only requirement is that the calibration set and the test set come from the same distribution within a stratum.
What we report:
- The active model’s empirical coverage on the held-out test set (currently 82% against a target of 80%)
- Coverage by revenue quartile and by macro-genre — we don’t average across strata
- An out-of-distribution flag when the input falls outside the calibration domain
“Calibrated” isn’t a marketing word for us. It’s a property we test for and refuse to ship without.
Per-genre stratified calibration
Aggregate coverage across all genres can hide bucket-level miscalibration. A model might over-cover casual games (intervals too wide) and under-cover roguelites (intervals too narrow), while the global average looks acceptable. For an indie dev making a launch decision, average coverage is the wrong number.
We stratify the calibration set by macro-genre cluster. Each cluster gets independently-validated coverage. Whatever your genre, the published cone has been validated on held-out launches in your stratum—not on a pool that happens to include fifty unrelated games diluting the interval.
We separately validate that revenue-quartile coverage holds. Top-quartile (AAA-tier) launches get supplemental calibration that prevents the genre-cluster averaging from leaving them under-covered — a quirk of heavy-tailed data we discovered the hard way.
Active-model coverage snapshot
| Metric | Value | Target / floor | Status |
|---|---|---|---|
| Overall empirical coverage (P10–P90) | 82% | 80% | PASS |
| Top revenue quartile (Q4) coverage | 71.9% | 70.0% floor | PASS |
| Q1–Q3 coverage | All within tolerance | 80% ± 5% | PASS |
| Calibration corpus size | 1,560 historical Steam launches held out from training, graded against derived revenue ground truth | ||
| Active model version | boxleiter_v1_1_2026_05_05 | ||
These numbers are recomputed and re-gated on every retrain. A model that fails the per-quartile or per-genre coverage gate does not ship; the prior model stays active until the new one passes.
Honest causal estimates for marketing-lever effects
When we tell you “adding a demo would change your forecast by $X” we’re reporting a causal effect estimate, not a correlational one. Most calculators ignore this distinction; we don’t.
The naive approach—flip a feature, re-run the predictor, report the delta—captures whatever the model learned from observed correlations. But indie launch data is full of confounding: studios that ship demos are systematically different from studios that don’t (more polished, more localized, more followed). A “+$340K from a demo” reading from naive analysis can be entirely selection bias, not the demo’s actual effect.
We use methods from the modern causal-inference literature to residualize against confounders before estimating each lever’s effect. Each estimate ships with a 95% confidence interval. When the CI overlaps zero, we say so explicitly—we tag the row “not significant” rather than claim a lift we can’t prove. We’re the only forecaster on the market that does this.
What this means in practice: if our naive forecast says a demo would lift you by 20% but the causal estimate says “+5% with CI [−15%, +30%]”, we tell you the second number. Knowing that the demo lever is statistically indistinguishable from zero might be worth more than a confident-but-misleading recommendation.
Total Lift Attribution: fixing the Steam UTM black hole
Steam’s native UTM tracking under-reports paid-campaign wishlists by approximately 75%. The reason is mechanical: users browse outside the Steam client, log in later, or follow links that lose UTM tags — and the wishlist event is recorded against an empty referrer. Devs see “$3.30 cost per wishlist” from a campaign that was actually delivering $1.71, decide it’s losing money, and kill it.
Total Lift Attribution fixes this. Upload your Steamworks “Wishlists by day” CSV; log your campaigns. We compute a rolling pre-campaign baseline, subtract it from observed wishlists in the campaign window, and surface your true per-campaign cost per wishlist alongside what Steam reported. The difference, on average, is the difference between a profitable campaign you would have killed and one you can correctly scale.
What we don’t expose: the exact baseline window length, the robust-statistic choice, or the de-noising heuristics. The goal is reliable answers, not a recipe.
Comp set: named launches, not anonymous similarities
Every forecast surfaces five real historical launches similar to your game — named, linkable, with documented outcomes. Not anonymous “similar titles.” Indie devs can read the comp set and verify whether the model’s “what’s similar” intuition matches their own.
The retrieval is vector-similarity over a curated revenue-graded corpus, post-filtered for tier match so you don’t see a $3 mobile-style puzzle game appear as a comp for a $20 roguelite just because the descriptions share a few keywords. Each comp’s revenue figure carries a quality grade so you can weight better-documented comps more heavily in your own judgment.
When the comp set looks wrong, that itself is a signal: the model may be operating out-of-distribution. We flag this explicitly.
Data sources and ground truth
Corpus: game-level metadata is sourced from Steam’s public Web and Storefront APIs — name, genres, tags, price, description, release date, review count, and other public-page fields. Ingestion runs at a polite rate from a single IP with a transparent User-Agent identifying us per RFC etiquette.
Wishlist counts are not exposed by Steam. When you supply your own number, we use it. Otherwise we infer it from public follower-trajectory signals via genre-conditional mappings. Whether wishlist count is measured or inferred is disclosed in every forecast output, and the resulting uncertainty is propagated into the published interval.
Revenue ground truth for our calibration corpus is derived from public review counts combined with launch price and ownership-range data. Each comp game’s revenue figure is graded A/B/C/D depending on whether a developer postmortem, third-party historical data, or estimate-only is available; the grade is shown alongside each comp.
For Total Lift Attribution, you upload your own Steamworks wishlist export. That data stays scoped to your email + appid — we don’t share, sell, or train on it.
What we don’t claim
We are not predicting your launch outcome with certainty. The P50 point inside our cone has the same fundamental uncertainty as any other formula’s point estimate. Our value is that the P10–P90 range makes that uncertainty explicit and empirically validated rather than hidden.
Coverage guarantees are over the calibration distribution. If your game is unlike anything in our calibration corpus — a truly novel mechanic, no identifiable comp class — the model is operating out-of-distribution and our published coverage claim may not hold. We flag this when comp-set similarity falls below threshold, and recommend treating the interval as illustrative.
We do not have proprietary access to live Steamworks data unless you upload it yourself for the Attribution feature. Our edge on the forecasting side is rigorous calibration of the public-data signal plus honest causal estimates — not better data.
Wrong predictions and what they tell us
Every Steam revenue forecaster publishes their best case. We’re the only one that publishes the cases where any forecaster — including ours — would have been catastrophically wrong, and explains why.
Steam launch revenue follows a heavy-tailed distribution with documented breakout outliers an order of magnitude beyond the calibration set. Five named cases from the 2024–2025 cycle:
- Peak — ~266× wishlist-to-sales conversion. Pre-launch comp set predicted a long-tail hit; the actual outcome exceeded the P90 ceiling by approximately 10×.
- Nubby’s Number Factory — ~140× conversion. No comparable pre-launch comp; our model would have flagged this as out-of-distribution before any cone was published.
- Mage Arena — ~78× conversion. Genre cluster (multiplayer arena) had three established comps; outcome cleared P90 by ~3×.
- R.E.P.O. — ~68× conversion. Co-op horror cluster comps capped P90 around the realized P50 by streamer cycle.
- Webfishing — ~67× conversion. Cozy-multiplayer cluster has no pre-launch precedent at this scale; flagged OOD.
Three structural reasons our cone breaks at the top of the tail:
- Pre-launch signal is publicly censored. Wishlist count, follower trajectory, and review velocity capture the median launch trajectory. They cannot capture the streamer-pickup, Twitter-virality, or word-of-mouth shocks that 100×-the-median launches require, because those events are post-launch.
- The conformal interval is anchored on the calibration distribution. When the realized outcome is more than 5× the P90 ceiling, the calibration set never contained an analog. We disclose this rather than widen intervals to absurdity.
- Selection bias inflates "comparable" comp sets. Many breakouts have no genuine comp; the closest matches by description are still 50× below the realized outcome.
What this means for you: our cone is a credible 80% interval — not a 99% interval, not a 100% interval. There is a ~10% probability your launch lands above our P90, and roughly half of that mass is “hit harder than any pre-launch model could have predicted.” If you want to budget for that scenario, treat P90 as the floor of your upside case, not the ceiling.
We do not have a confidential moat that captures these breakouts. Anyone who claims to forecast a launch like Peak or Webfishing in advance is selling certainty that cannot exist on public pre-launch data. That gap is the one cost of being honest.
Who built this?
Built by Greg C. ([email protected]), a senior software engineer with production ML experience in predictive analytics for sports—a domain where calibrated revenue prediction under fat-tailed uncertainty separates a working strategy from a losing one. The same calibration discipline that handles player-performance variance translates directly to indie launch revenue, where wishlist-to-sales conversion has the same heavy-tailed structure.
If something on this page is wrong or unclear, email me directly. I read every message.
Frequently asked questions
How do I forecast Steam launch revenue?
Modern Steam forecasting starts from wishlist count at launch and applies a per-genre conversion multiplier (typically 0.10×–0.30× of wishlists for first-week sales). The challenge: this multiplier varies 10–20× across launches at the same wishlist count. The honest approach is to publish a calibrated P10–P90 range whose coverage has been validated on held-out launches per genre, rather than a misleading point estimate.
Why are point-estimate Steam revenue forecasts misleading?
Steam launch revenue follows a heavy-tailed distribution: the standard wishlist-to-sales conversion ratio varies from 0.10× at the median to 10× at the upper-decile. A point estimate of “$50K” on a launch where the true 80% interval is “$10K–$300K” isn’t predictive—it’s a coin flip presented as certainty.
What is calibrated forecasting and why does it matter?
A forecast cone is “calibrated” if its claimed coverage matches its empirical coverage: an 80%-coverage P10–P90 interval should contain the true outcome 80% of the time on held-out test data. Most Steam revenue calculators publish point estimates with no coverage validation. We validate coverage on real historical launches before any forecast is published, and re-validate on every retrain.
How do you compute the “what-if” lever effects?
We use modern causal-inference methods to residualize against confounders before estimating each lever’s effect. Naive perturbation of a single feature in a predictor captures correlation, not causation. Each lever ships with a 95% CI; when the CI overlaps zero we explicitly tag the row “not significant” rather than report an uncertain effect as if it were proven.
How does the Total Lift Attribution work?
You upload your Steamworks “Wishlists by day” CSV. For each campaign you log (Meta Ads, streamer, festival, etc.), we compute a rolling pre-campaign baseline and subtract it from observed wishlists in the campaign window. The difference is your true campaign lift. Combined with spend, that gives you the true cost per wishlist alongside what Steam UTM reported.
Can I rely on this for marketing-budget decisions?
The active model passes the size-stratified coverage gate at 82% empirical coverage on held-out launches, with per-genre coverage all within tolerance of the 80% target. Out-of-distribution inputs are flagged before any cone is shown. The cone is fit for budget-decision use within its stratum; treat OOD-flagged forecasts as illustrative.
How is Steam Launch Forecaster different from the Boxleiter method?
The Boxleiter method is a single-multiplier heuristic (originally ~50 sales per Steam review, modernized to 30–63×). It produces a point estimate with no uncertainty quantification. Per the Boxleiter formula’s own author in 2023, ~24% of games are off by more than 30%. Steam Launch Forecaster publishes a calibrated P10–P90 range whose 80% empirical coverage is validated on held-out historical launches per genre cluster, plus per-quartile coverage gates that prevent top-revenue under-coverage. The Boxleiter heuristic is a back-of-envelope sanity check; our cone is a budgetable risk range. We also publish a Boxleiter cross-check on every forecast so devs can compare the two side-by-side.
How is Steam Launch Forecaster different from Gamalytic?
Gamalytic ($25–75/mo subscription) is primarily a market-intelligence dashboard for browsing the Steam catalog — revenue estimates, ownership ranges, comp-set lookup, genre and tag analytics. Their pricing is volume-based around how many appids you query per month. Steam Launch Forecaster is built around a single workflow: produce a calibrated risk range plus causal counterfactuals for one specific launch you’re planning, with empirical 80% coverage validation and a public log of where any forecaster (including us) is structurally wrong. We also ship Total Lift Attribution to recover true paid-campaign CPW from your Steamworks data — Gamalytic does not. Pricing is a single $299 launch report, not a recurring subscription.
How is Steam Launch Forecaster different from VG Insights?
VG Insights ($14.50/mo) is the closest methodology cousin — both publish methodology openly, both target the AA-and-indie analyst cohort, and both estimate revenue with stated margins. The VG Insights stack centers on revenue estimates with a stated ~±5% margin on aggregate market figures, plus catalog browsing tools. Steam Launch Forecaster differs in three ways: (1) our outputs are a P10–P90 cone with empirically-validated coverage rather than a single number with a margin; (2) we report causal lever effects with 95% CIs (and tag “not significant” when CIs overlap zero) rather than correlational deltas; (3) we publish a wrong-predictions log naming where any model breaks down on top-decile breakouts. Use VG Insights for ongoing market analytics; use us for a single launch’s budget-decision risk range.
Try it: free single-game forecast or upload your Steamworks data for true campaign attribution.