Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR
This paper reveals that in lifelong fine-tuning of large language models, coverage reliability can deteriorate faster than accuracy, and introduces a calibration replay method to maintain coverage effectively.
Contribution
It demonstrates coverage collapse occurs before accuracy drops and proposes a lightweight calibration replay technique to preserve coverage during continual learning.
Findings
Coverage loss exceeds accuracy loss by about 3.4 times on average.
Coverage can drop from 0.92 to 0.61 while accuracy remains stable.
Calibration replay restores coverage within two points of the nominal level.
Abstract
Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
