A 32B clinical forecaster that matches GPT-5 on ICU outcomes at a fraction of the size
A sample training example — question, source, and outcome-derived label.
Benchmark comparisons against frontier models.
Our 32B trained model matches GPT-5 on Brier (0.149 vs. 0.148) while achieving 2.7× better calibration (ECE 0.031 vs. 0.084) — and it beats the clinical base rate (0.149 vs. 0.180). gpt-oss-120B trails both on every metric. Initial results; more to come as we publish the case study.
Papers, models, datasets, notebooks, and write-ups for this case study.
Leverage your own raw data or use public sources. No labeling required.