Medical Outcomes

2.7×

better calibrated than GPT-5 (ECE 0.031 vs. 0.084)

17%

lower Brier score than the clinical base rate

What we did

Built ICU patient trajectories from MIMIC-III — vitals, lab panels, clinical notes, medications.
Anchored forecasting questions at each point in the ICU stay (discharge, deterioration, escalation).
Labeled each question using the actual downstream outcome recorded in the EHR.
Trained a 32B model and benchmarked against GPT-5 and gpt-oss-120B.

Example datapoint

A sample training example — question, source, and outcome-derived label.

DATASET ICU trajectories

Question

Given the first 24 hours of ICU admission (vitals, labs, initial note, ventilator settings at hour 18), will this patient die before discharge?

Question source

MIMIC-III admission note ICU hour 24

Septic shock on presentation; lactate 4.2, norepi 0.18 mcg/kg/min

Label

No.

Type

binary

Confidence

0.81

Label source

Discharge summary ICU day 11

Extubated day 6; discharged to ward in stable condition

DATASET ICU trajectories

Question

What is the probability this patient is discharged alive within 72 hours given their current hour-48 trajectory (SOFA 3, off pressors, weaning vent)?

Question source

Nursing flowsheet ICU hour 48

SOFA dropped from 9 to 3; SBT passed this morning

Label

0.64

Type

continuous

Confidence

0.78

Label source

EHR discharge event ICU hour 66

Transferred to step-down unit; ICU discharge confirmed

DATASET ICU trajectories

Question

Will this ICU patient require renal replacement therapy within 48 hours given hour-12 Cr trajectory (1.4 → 2.1) and urine output <0.3 mL/kg/hr for the last 6 hours?

Question source

Lab panel ICU hour 12

KDIGO stage 2 AKI; nephrology consulted

Label

Yes.

Type

binary

Confidence

0.83

Label source

Procedure log ICU hour 42

CVVHDF initiated; volume-overloaded and hyperkalemic

Results

Benchmark comparisons against frontier models.

ICU Outcome Prediction: Brier and Calibration Error

Our 32B trained model matches GPT-5 on Brier (0.149 vs. 0.148) while achieving 2.7× better calibration (ECE 0.031 vs. 0.084) — and it beats the clinical base rate (0.149 vs. 0.180). gpt-oss-120B trails both on every metric. Initial results; more to come as we publish the case study.

Two bar charts comparing Foresight (32B), GPT-5, and gpt-oss-120B on Brier score and Expected Calibration Error for MIMIC-III ICU outcome prediction

Papers, models, datasets, notebooks, and write-ups for this case study.

Notebook github.com

Survival LLM fine-tune

End-to-end notebook for training an ICU survival forecaster

What we did

Example datapoint

Results

ICU Outcome Prediction: Brier and Calibration Error

Read more

Ready to build your own expert?