Medical Outcomes

A 32B clinical forecaster that matches GPT-5 on ICU outcomes at a fraction of the size

2.7×
better calibrated than GPT-5 (ECE 0.031 vs. 0.084)
17%
lower Brier score than the clinical base rate

What we did


Example datapoint

A sample training example — question, source, and outcome-derived label.


Results

Benchmark comparisons against frontier models.

ICU Outcome Prediction: Brier and Calibration Error

Our 32B trained model matches GPT-5 on Brier (0.149 vs. 0.148) while achieving 2.7× better calibration (ECE 0.031 vs. 0.084) — and it beats the clinical base rate (0.149 vs. 0.180). gpt-oss-120B trails both on every metric. Initial results; more to come as we publish the case study.

Two bar charts comparing Foresight (32B), GPT-5, and gpt-oss-120B on Brier score and Expected Calibration Error for MIMIC-III ICU outcome prediction

Read more

Papers, models, datasets, notebooks, and write-ups for this case study.

Ready to build your own expert?

Leverage your own raw data or use public sources. No labeling required.