A 32B model that beats frontier LLMs 100× its size on live markets
A sample training example — question, source, and outcome-derived label.
Benchmark comparisons against frontier models.
Foresight V3 holds the #1 spot on ProphetArena's live benchmark, ahead of Gemini 3 Pro and GPT-5.2 — while being 10–100× smaller than the frontier models it beats.
On 251 live Polymarket questions, Foresight-32B achieved Brier score 0.199 vs. GPT-5's 0.207 — with 69% lower calibration error (ECE 6.0% vs. 16.1%) and positive simulated trading profit while frontier models lost money.
Papers, models, datasets, notebooks, and write-ups for this case study.
Leverage your own raw data or use public sources. No labeling required.