Economic Forecasting

A 32B model that beats GPT-5 on Fed Beige Book macro questions

22%
lower Brier score than GPT-5 on Fed Beige Book forecasting questions
↗ Dataset
better calibration than GPT-5 (ECE 0.029 vs. 0.188)
↗ Dataset
fewer output tokens than GPT-5 — dramatically cheaper inference at higher accuracy
↗ Dataset

What we did


Example datapoint

A sample training example — question, source, and outcome-derived label.


Results

Benchmark comparisons against frontier models.

Better Accuracy, Skill, and Calibration vs. GPT-5

Trained on Fed Beige Book narratives, Foresight posts a Brier score of 0.155 vs. 0.199 for GPT-5 and 0.211 for the base model — a 22% reduction in error. It is the only model to beat the base rate (Brier Skill Score +6.2% vs. −20.7% for GPT-5 and −27.7% for the base model), and cuts calibration error (ECE) by ~6× vs. GPT-5.

Three bar charts comparing Foresight (step150), GPT-5, and the base model on Binary Brier (0.155 / 0.199 / 0.211), Brier Skill Score vs. base rate (+6.2% / −20.7% / −27.7%), and Binary ECE (0.029 / 0.188 / 0.189) for Fed Beige Book forecasting

Probabilities That Match Reality — Unlike GPT-5

Foresight (yellow) hugs the perfect-calibration diagonal — when it says 30%, roughly 30% of events materialize. GPT-5 and the base model are systematically overconfident, drifting well below the line at higher probabilities.

Reliability diagram across deciles showing Foresight (step150) tracking the perfect-calibration diagonal while GPT-5 and the base model deviate significantly below it

Read more

Papers, models, datasets, notebooks, and write-ups for this case study.

Ready to build your own expert?

Leverage your own raw data or use public sources. No labeling required.