Economic Forecasting

Learn which narratives and indicators precede macro data surprises

Build forecasters from timestamped economic text and indicators—surveys, central bank communications, markets commentary, and hard data as it was known at the time. Future prints and outcomes become labels, so models learn what actually predicted inflation, labor, growth, and policy shifts rather than fitting post-hoc stories.

22%
lower Brier score than GPT-5 on Fed Beige Book forecasting questions
↗ Dataset
better calibration than GPT-5 (ECE 0.029 vs. 0.188)
↗ Dataset
fewer output tokens than GPT-5 — dramatically cheaper inference at higher accuracy
↗ Dataset

Example prediction questions

The kinds of questions a model trained on your data can answer.


Key results

Benchmark comparisons against frontier models

Better Accuracy, Skill, and Calibration vs. GPT-5

Trained on Fed Beige Book narratives, Foresight posts a Brier score of 0.155 vs. 0.199 for GPT-5 and 0.211 for the base model — a 22% reduction in error. It is the only model to beat the base rate (Brier Skill Score +6.2% vs. −20.7% for GPT-5 and −27.7% for the base model), and cuts calibration error (ECE) by ~6× vs. GPT-5.

Three bar charts comparing Foresight (step150), GPT-5, and the base model on Binary Brier (0.155 / 0.199 / 0.211), Brier Skill Score vs. base rate (+6.2% / −20.7% / −27.7%), and Binary ECE (0.029 / 0.188 / 0.189) for Fed Beige Book forecasting

Calibration Reliability Diagram

On the reliability diagram, Foresight (yellow) hugs the perfect-calibration diagonal across deciles — when it says 30%, roughly 30% of events materialize. GPT-5 and the base model are systematically overconfident, drifting well below the line at higher predicted probabilities. For macro decisions where the magnitude of a probability drives sizing and hedging, calibrated outputs are the difference between actionable signal and noise.

Reliability diagram across deciles showing Foresight (step150) tracking the perfect-calibration diagonal while GPT-5 and the base model deviate significantly below it

Explore

Primary write-ups and artifacts for this solution.

Ready to build your own expert?

Leverage your own raw data or use public sources. No labeling required.