Methodology: how we forecast Steam launches

Q: What is calibrated forecasting and why does it matter for Steam launches?

A forecast cone is 'calibrated' if its claimed coverage matches its empirical coverage: an 80%-coverage P10-P90 interval should contain the true outcome 80% of the time on held-out test data. Most Steam revenue calculators publish point estimates with no coverage validation. Calibrated forecasting uses statistical procedures from the conformal-prediction literature to prove the coverage claim on real historical launches before any forecast is published.

Q: How does Steam Launch Forecaster differ from Boxleiter?

The Boxleiter formula is a point estimator: revenue approximately equals review-count multiplied by ~50 (or wishlists multiplied by 0.20). It works for industry-aggregate analysis but is notoriously inaccurate per-game (24% of games are more than 30% off per the formula's own author in 2023). Steam Launch Forecaster publishes calibrated risk ranges (P10-P90) using conformalized quantile regression with per-genre calibration. Different problem: we quantify uncertainty rather than pretend it doesn't exist.

An honest technical overview of the calibrated risk-quantification engine shipping June 2026.

The variance problem with Steam launch estimates

The industry default for estimating Steam launch revenue is the Boxleiter heuristic: roughly 50 sales per Steam review in the original formulation (1990s–2010s), updated to approximately 30–63 sales per review in modern variants per GameDiscoverCo and Simon Carless’s 2025 data. For wishlist-to-sales conversion, the parallel rule of thumb is 0.15–0.30 × wishlist count = first-week sales.

These are point estimates. And that is where the problem starts.

GameDiscoverCo’s published 2024–2025 State of Steam Wishlist Conversions data shows that wishlist-at-launch vs. week-1-sales varies by 10–20× across launches with comparable wishlist counts. Documented high-end outliers include: Peak (266× conversion), Nubby’s Number Factory (140×), Mage Arena (78×), R.E.P.O. (68×), and Webfishing (67×). These are not data errors—they are features of the real distribution.

When the underlying distribution spans four or more orders of magnitude, a point estimate is structurally misleading. Publishing “$80K expected” when the honest answer is “$20K–$400K with 80% probability” does not simplify the problem—it hides it. The empirical case for risk-quantification is not that better models will narrow the interval to a fine point; it’s that the true interval is wide and you need to know that before committing marketing budget.

Sources: How To Market A Game — Boxleiter benchmarks; GameDiscoverCo State of Steam Wishlist Conversions (2024–2025).

Conformalized Quantile Regression: coverage guarantees

Our forecaster is built on Conformalized Quantile Regression (CQR), introduced by Romano, Patterson, & Candès (NeurIPS 2019).

The core idea: quantile regression produces lower- and upper-bound predictions; conformal prediction wraps those raw predictions with a calibration step that empirically corrects them so the resulting interval has a provable coverage probability. For any nominal level α (e.g., 80% coverage corresponds to α = 0.20), the published P10–P90 interval contains the true revenue with probability at least 1−α over the calibration distribution—with finite-sample guarantees, not asymptotic approximations.

Crucially, this is a distribution-free guarantee. It does not require assumptions about Gaussian errors, linear relationships, or any parametric form for the revenue distribution. The only assumption is exchangeability of the calibration and test sets within a genre group.

This is the precise meaning of “calibrated” in our marketing copy. Before shipping each model version, we validate empirical coverage on a held-out test set and will publish those validation statistics at launch.

Foundational references: Romano, Y., Patterson, E., & Candès, E. (2019). Conformalized Quantile Regression. NeurIPS 2019. arXiv:1905.03222. — Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.

Per-genre Mondrian calibration

Aggregate coverage across all genres can hide bucket-level miscalibration. A model might over-cover casual games (intervals too wide) and under-cover roguelites (intervals too narrow), while the global average looks acceptable. For indie dev decision-making, average coverage is the wrong statistic.

Our calibration is partitioned by macro-genre, applying the Mondrian Conformal Prediction approach (Vovk, 2005, §4.5). Each genre group gets its own calibration set and its own coverage guarantee, independently.

The result: whatever your game’s macro-genre, the published P10–P90 cone holds within 10 percentage points of nominal coverage on held-out launches in the same group. No “average looks fine” handwaving. If your game is a roguelite, the coverage claim was validated on held-out roguelite launches—not on a pool that happens to include fifty casual puzzle games diluting the interval.

Nearest-neighbor comp set: explainability

Every forecast surfaces five historical comp launches similar to the user’s game. Named, linkable, documented games—not anonymous “similar titles.”

Similarity is computed using modern semantic embeddings of game descriptions and tag sets, with structural filters on macro-genre, price tier, and release window applied before the nearest-neighbor ranking. The point of those filters: prevent a $3 mobile-style puzzle game from appearing as a comp for a $20 roguelite just because the descriptions happen to share a few keywords.

Why this matters for users: indie developers can read the five comp games and verify whether the model’s “what’s similar” intuition matches their own. Anonymous comps destroy trust; named real launches with documented outcomes build it. When the comp set looks wrong, that is itself a signal that the model may be operating out-of-distribution—a flag we surface explicitly.

Data sources and ground truth

Corpus: Game-level metadata is sourced from Steam’s public Web and Storefront APIs — name, genres, Steam tags, price, description, release date, review count, and other public-page fields. Ingestion runs sequentially at a polite rate from a single IP, with a transparent User-Agent identifying us per RFC etiquette.

Wishlist counts are not exposed by Steam. Where a developer doesn’t supply their own number, we infer it from public follower data via genre-conditional multipliers based on the published methodology from GameDiscoverCo and WN Hub. The fact that wishlist count is inferred (rather than measured) is disclosed in every forecast output, and the resulting uncertainty is propagated into the published interval.

Revenue ground truth is derived from public review counts using the standard industry multipliers (Carless’s New Boxleiter family) combined with launch price. Each comp game’s revenue figure is graded A/B/C/D depending on whether a developer postmortem, third-party historical data, or estimate-only is available, and the grade is shown alongside each comp.

Bootstrap corpus: We attribute the Steam Dataset 2025 (vintagedon/steam-dataset-2025, CC BY 4.0) for a sample dataset used in early calibration work.

What we don’t claim

We are not predicting your launch outcome with certainty. The P50 point inside our range has the same fundamental uncertainty as any other formula’s point estimate; our value is that the P10–P90 range makes the uncertainty explicit and empirically validated rather than hidden.

Coverage guarantees are over the calibration distribution. If your game is unlike anything in our calibration corpus—a truly novel mechanic, no identifiable comp class, a genre we don’t have sufficient calibration data for—the model is operating out-of-distribution and our published coverage claim may not hold. We flag this explicitly when comp-set similarity falls below a threshold, and recommend treating the interval as illustrative rather than statistically grounded.

We do not have proprietary access to developer wishlist counts, Steamworks backend data, or pre-launch metrics that incumbents may claim. Our edge is rigorous calibration of the public-data signal, not better data. The wishlist-count inference is acknowledged above, and its uncertainty is propagated into the forecast interval.

Who built this?

Built by Greg C. ([email protected]), a senior software engineer with production ML experience in predictive analytics for sports—a domain where calibrated revenue prediction under fat-tailed uncertainty separates a working strategy from a losing one. The same calibration methodology that handles player-performance variance translates directly to indie launch revenue, where wishlist-to-sales conversion has the same heavy-tailed structure. The forecaster shipping June 2026 is that approach, applied to Steam.

If something on this page is wrong or unclear, email me directly. I read every message.

Frequently asked questions

How do I forecast Steam launch revenue?

Modern Steam forecasting starts from wishlist count at launch and applies a per-genre conversion multiplier (typically 0.10×–0.30× of wishlists for first-week sales). The challenge is that this multiplier varies by 10–20× across launches at the same wishlist count. The honest approach: use a conformal-prediction quantile regression model that publishes calibrated P10–P90 ranges rather than a misleading point estimate, validated on held-out historical launches per genre.

Why are point-estimate Steam revenue forecasts misleading?

Steam launch revenue follows a heavy-tailed distribution: the standard wishlist-to-sales conversion ratio varies from 0.10× at the median to 10× at the upper-decile (per GameDiscoverCo 2024–2025 data). A point estimate of “$50K” on a launch where the true 80% interval is “$10K–$300K” isn’t predictive—it’s a coin flip presented as certainty. Calibrated conformal-prediction ranges quantify this uncertainty empirically rather than hiding it.

What is calibrated forecasting and why does it matter for Steam launches?

A forecast cone is “calibrated” if its claimed coverage matches its empirical coverage: an 80%-coverage P10–P90 interval should contain the true outcome 80% of the time on held-out test data. Most Steam revenue calculators publish point estimates with no coverage validation. Calibrated forecasting uses statistical procedures from the conformal-prediction literature to prove the coverage claim on real historical launches before any forecast is published.

How does Steam Launch Forecaster differ from Boxleiter?

The Boxleiter formula is a point estimator: revenue ≈ review-count × ≈50 (or wishlists × 0.20). It works for industry-aggregate analysis but is notoriously inaccurate per-game (24% of games are >30% off per the formula’s own author in 2023). Steam Launch Forecaster publishes calibrated risk ranges (P10–P90) using conformalized quantile regression with per-genre calibration. Different problem: we quantify uncertainty rather than pretend it doesn’t exist.

Can I rely on Steam Launch Forecaster for marketing-budget decisions?

The product ships June 2026. We will publish empirical coverage validation on held-out launches per macro-genre at ship time. If our P10–P90 cone holds within 10 percentage points of 80% nominal coverage on your genre’s calibration set, the cone is fit for budget-decision use. We will surface a confidence flag when the input game is out-of-distribution from our calibration corpus.

References

— Romano, Y., Patterson, E., & Candès, E. (2019). Conformalized Quantile Regression. NeurIPS 2019. arXiv:1905.03222
— Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. (Foundational text on conformal prediction; Mondrian CP introduced in §4.5.)
— Carless, S. / GameDiscoverCo. State of Steam Wishlist Conversions (2024–2025). gamediscoverco.substack.com
— How To Market A Game. Boxleiter Number benchmarks (2023). howtomarketagame.com
— Steam Dataset 2025. vintagedon/steam-dataset-2025 (CC BY 4.0).

Pre-sale at $50 locks in beta access to the full calibrated engine when it ships June 2026, plus 3 months at public launch. → See homepage