Economic Forecasting

22%

lower Brier score than GPT-5 on Fed Beige Book forecasting questions

↗ Dataset

6×

better calibration than GPT-5 (ECE 0.029 vs. 0.188)

↗ Dataset

6×

fewer output tokens than GPT-5 — dramatically cheaper inference at higher accuracy

↗ Dataset

Example prediction questions

The kinds of questions a model trained on your data can answer.

DATASET Macro prints

Question

Will next month’s CPI print come in above the consensus forecast?

Question source

Reuters Apr 2, 2025

Economists pencil in +0.3% core CPI month-over-month

Label

Yes.

Type

binary

Confidence

0.87

Label source

BLS May 13, 2025

Consumer Price Index summary: core rose 0.4% in April

DATASET Macro prints

Question

What is the probability that the FOMC reduces the federal funds target range by 25 bps at the Jan 28–29, 2026 meeting, conditional on the Dec 2025 Summary of Economic Projections median dot at 3.6% end-2026 and fed funds futures pricing 1.2 cuts by midyear?

Question source

CME FedWatch Jan 10, 2026

Implied probability of Jan cut rises to 38% after cooler payrolls revision

Label

0.38

Type

continuous

Confidence

0.86

Label source

FOMC statement Jan 29, 2026

Committee holds target range unchanged; forward guidance softened

DATASET Macro prints

Question

Will the BLS JOLTS job openings level print below 7.5 million for Dec 2025 (release Feb 3, 2026) after three consecutive sub-8m reads and ISM Services Employment at 48.9 in the January survey?

Question source

Wall Street Journal Jan 22, 2026

Economists split on whether JOLTS downtrend resumes after November bounce

Label

Yes.

Type

binary

Confidence

0.80

Label source

BLS JOLTS Feb 3, 2026

December job openings 7.2 million; hires little changed

Key results

Benchmark comparisons against frontier models

Better Accuracy, Skill, and Calibration vs. GPT-5

Trained on Fed Beige Book narratives, Foresight posts a Brier score of 0.155 vs. 0.199 for GPT-5 and 0.211 for the base model — a 22% reduction in error. It is the only model to beat the base rate (Brier Skill Score +6.2% vs. −20.7% for GPT-5 and −27.7% for the base model), and cuts calibration error (ECE) by ~6× vs. GPT-5.

Three bar charts comparing Foresight (step150), GPT-5, and the base model on Binary Brier (0.155 / 0.199 / 0.211), Brier Skill Score vs. base rate (+6.2% / −20.7% / −27.7%), and Binary ECE (0.029 / 0.188 / 0.189) for Fed Beige Book forecasting

↗ Dataset

Calibration Reliability Diagram

On the reliability diagram, Foresight (yellow) hugs the perfect-calibration diagonal across deciles — when it says 30%, roughly 30% of events materialize. GPT-5 and the base model are systematically overconfident, drifting well below the line at higher predicted probabilities. For macro decisions where the magnitude of a probability drives sizing and hedging, calibrated outputs are the difference between actionable signal and noise.

Reliability diagram across deciles showing Foresight (step150) tracking the perfect-calibration diagonal while GPT-5 and the base model deviate significantly below it

↗ Dataset

Explore

Primary write-ups and artifacts for this solution.

Case study Using the Future to Train Prediction Models → Notebook Fed Beige Book Forecasting Notebook →

Example prediction questions

Key results

Better Accuracy, Skill, and Calibration vs. GPT-5

Calibration Reliability Diagram

Explore

Ready to build your own expert?