Prediction Market Betting

A 32B model that beats frontier LLMs 100× its size on live markets

#1
ranked forecaster on ProphetArena, beating GPT-5, Gemini & Claude
↗ Blog post
69%
reduction in calibration error vs. base model on live Polymarket questions
↗ Blog post
10–100×
smaller than the frontier models it beats
↗ TMLR paper

What we did


Example datapoint

A sample training example — question, source, and outcome-derived label.


Results

Benchmark comparisons against frontier models.

ProphetArena Overall Leaderboard

Foresight V3 holds the #1 spot on ProphetArena's live benchmark, ahead of Gemini 3 Pro and GPT-5.2 — while being 10–100× smaller than the frontier models it beats.

ProphetArena leaderboard: Foresight V3 #1, Gemini 3 Pro #2, GPT-5.2 #3

Live Polymarket Benchmark

On 251 live Polymarket questions, Foresight-32B achieved Brier score 0.199 vs. GPT-5's 0.207 — with 69% lower calibration error (ECE 6.0% vs. 16.1%) and positive simulated trading profit while frontier models lost money.

Three bar charts comparing Brier score, calibration error, and simulated trading profit: Foresight-32B leads on all three vs. o3, Gemini 2.5 Pro, Grok-4, and Claude Opus

Read more

Papers, models, datasets, notebooks, and write-ups for this case study.

Ready to build your own expert?

Leverage your own raw data or use public sources. No labeling required.