Predict which 10-K risk factors actually precede enforcement, restatements, and drawdowns
A sample training example — question, source, and outcome-derived label.
Benchmark comparisons against frontier models.
Fine-tuned Qwen3-32B achieves Brier Skill Score +11.6% with ECE of 0.029 across 6,109 SEC risk queries — 64.7% lower calibration error than GPT-5 (ECE 0.081). The model learns to distinguish boilerplate legal language from meaningful signals preceding adverse outcomes.
Papers, models, datasets, notebooks, and write-ups for this case study.
Leverage your own raw data or use public sources. No labeling required.