Prediction Markets

overall on Prophet Arena, ahead of GPT-5.2, Gemini 3 Pro, Claude, Grok, and Kimi

69%

lower calibration error vs. the base model on live Polymarket questions

~65%

of the Brier gap closed between the base model and Polymarket prices

Foresight V3 was trained from historical open-web news. The SDK generated prediction questions from information available at each article timestamp, resolved them from later sources, and Foresight Learning reinforced the reasoning paths that produced better probabilities. The result reached #1 on Prophet Arena, where every model receives identical context.

What we did

Generated forecasting questions from historical open-web news using only information available at the source timestamp.
Resolved each question from later evidence, so outcomes came from what actually happened instead of human labels.
Fine-tuned with Foresight Learning, where multiple probability rollouts are scored against the resolved outcome.
Evaluated on Prophet Arena and live Polymarket questions against frontier models.

Primary artifacts for this case study.