Prediction Market Forecasting

ranked forecaster on ProphetArena, beating GPT-5, Gemini & Claude

↗ Blog post

69%

reduction in calibration error vs. base model on live Polymarket questions

↗ Blog post

10–100×

smaller than the frontier models it beats

↗ TMLR paper

Example prediction questions

The kinds of questions a model trained on your data can answer.

DATASET Market outcomes

Question

Will the US Fed cut rates by more than 25bps before end of Q3?

Question source

Wall Street Journal Jun 12, 2025

Fed officials signal openness to larger cuts if labor market softens

Label

No.

Type

binary

Confidence

0.88

Label source

Federal Reserve Sep 30, 2025

FOMC statement: 25 basis point reduction in target range

DATASET Market outcomes

Question

Will OpenAI ship a new flagship reasoning model to the ChatGPT API before Anthropic ships a Claude 4.5 successor, as of Polymarket close on Mar 31, 2026?

Question source

The Information Jan 9, 2026

Labs race to refresh flagship APIs after holiday traffic spike

Label

Yes.

Type

binary

Confidence

0.76

Label source

OpenAI Mar 18, 2026

API changelog: new flagship reasoning tier in gradual rollout

DATASET Market outcomes

Question

Will OPEC+ announce a combined voluntary cut of at least 1 million barrels per day at the Jan 5, 2026 ministerial videoconference?

Question source

Reuters Dec 12, 2025

Delegates expect voluntary cuts extension; deeper reduction on agenda

Label

No.

Type

binary

Confidence

0.83

Label source

OPEC Jan 5, 2026

Joint Ministerial Monitoring Committee rolls over existing quotas

Key results

Benchmark comparisons against frontier models

ProphetArena Overall Leaderboard

Foresight V3 holds the #1 spot on ProphetArena's live benchmark, ahead of Gemini 3 Pro and GPT-5.2 — while being 10–100× smaller than the frontier models it beats.

ProphetArena leaderboard: Foresight V3 #1, Gemini 3 Pro #2, GPT-5.2 #3

↗ Blog post

Live Polymarket Benchmark

On 251 live Polymarket questions, Foresight-32B achieved Brier score 0.199 vs. GPT-5's 0.207 — with 69% lower calibration error (ECE 6.0% vs. 16.1%) and positive simulated trading profit while frontier models lost money.

Three bar charts comparing Brier score, calibration error, and simulated trading profit: Foresight-32B leads on all three vs. o3, Gemini 2.5 Pro, Grok-4, and Claude Opus

↗ Blog post

Explore

Primary write-ups and artifacts for this solution.

Case study How we built the top-ranked AI forecaster → Paper Supervision at scale from real-world outcomes → Dataset Training data: market-style questions with resolved outcomes → Model Open forecasting model (research release) → Notebook Polymarket Backtesting Notebook →

Example prediction questions

Key results

ProphetArena Overall Leaderboard

Live Polymarket Benchmark

Explore

Ready to build your own expert?