Economic Forecasting

22%

lower Brier score than GPT-5 on Fed Beige Book forecasting questions

↗ Dataset

6×

better calibration than GPT-5 (ECE 0.029 vs. 0.188)

↗ Dataset

6×

fewer output tokens than GPT-5 — dramatically cheaper inference at higher accuracy

↗ Dataset

What we did

Built a time-indexed corpus of Beige Book narratives, CPI/payrolls/JOLTS releases, and survey data.
Generated macro forecasting questions at each publication date using later prints as labels.
Trained with RL against a calibration-aware reward.
Benchmarked against GPT-5 on held-out Beige Book windows.

Example datapoint

A sample training example — question, source, and outcome-derived label.

DATASET Macro prints

Question

Will next month’s CPI print come in above the consensus forecast?

Question source

Reuters Apr 2, 2025

Economists pencil in +0.3% core CPI month-over-month

Label

Yes.

Type

binary

Confidence

0.87

Label source

BLS May 13, 2025

Consumer Price Index summary: core rose 0.4% in April

DATASET Macro prints

Question

What is the probability that the FOMC reduces the federal funds target range by 25 bps at the Jan 28–29, 2026 meeting, conditional on the Dec 2025 Summary of Economic Projections median dot at 3.6% end-2026 and fed funds futures pricing 1.2 cuts by midyear?

Question source

CME FedWatch Jan 10, 2026

Implied probability of Jan cut rises to 38% after cooler payrolls revision

Label

0.38

Type

continuous

Confidence

0.86

Label source

FOMC statement Jan 29, 2026

Committee holds target range unchanged; forward guidance softened

DATASET Macro prints

Question

Will the BLS JOLTS job openings level print below 7.5 million for Dec 2025 (release Feb 3, 2026) after three consecutive sub-8m reads and ISM Services Employment at 48.9 in the January survey?

Question source

Wall Street Journal Jan 22, 2026

Economists split on whether JOLTS downtrend resumes after November bounce

Label

Yes.

Type

binary

Confidence

0.80

Label source

BLS JOLTS Feb 3, 2026

December job openings 7.2 million; hires little changed

Results

Benchmark comparisons against frontier models.

Better Accuracy, Skill, and Calibration vs. GPT-5

Trained on Fed Beige Book narratives, Foresight posts a Brier score of 0.155 vs. 0.199 for GPT-5 and 0.211 for the base model — a 22% reduction in error. It is the only model to beat the base rate (Brier Skill Score +6.2% vs. −20.7% for GPT-5 and −27.7% for the base model), and cuts calibration error (ECE) by ~6× vs. GPT-5.

Three bar charts comparing Foresight (step150), GPT-5, and the base model on Binary Brier (0.155 / 0.199 / 0.211), Brier Skill Score vs. base rate (+6.2% / −20.7% / −27.7%), and Binary ECE (0.029 / 0.188 / 0.189) for Fed Beige Book forecasting

↗ Dataset

Probabilities That Match Reality — Unlike GPT-5

Foresight (yellow) hugs the perfect-calibration diagonal — when it says 30%, roughly 30% of events materialize. GPT-5 and the base model are systematically overconfident, drifting well below the line at higher probabilities.

Reliability diagram across deciles showing Foresight (step150) tracking the perfect-calibration diagonal while GPT-5 and the base model deviate significantly below it

↗ Dataset

Papers, models, datasets, notebooks, and write-ups for this case study.

Blog post blog.lightningrod.ai

Using the Future to Train Prediction Models

Background on future-as-label training until macro-specific case studies are published

Notebook github.com

Beige Book forecasting notebook

Generate forecasting questions from Federal Reserve Beige Book reports, using future reports as labels

What we did

Example datapoint

Results

Better Accuracy, Skill, and Calibration vs. GPT-5

Probabilities That Match Reality — Unlike GPT-5

Read more

Ready to build your own expert?