CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Lukas Thede; Stefan Winzeck; Zeynep Akata; Jonathan Richard Schwarz

arXiv:2603.06610·cs.LG·March 10, 2026

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz

PDF

Open Access 1 Datasets

TL;DR

CapTrack introduces a capability-centric framework to analyze and understand the diverse aspects of forgetting in large language models after post-training, revealing that forgetting affects robustness and behavior beyond just factual knowledge.

Contribution

The paper presents CapTrack, a novel framework combining behavioral taxonomy and evaluation suite to analyze forgetting in LLMs across various models and post-training methods.

Findings

01

Forgetting impacts robustness and default behaviors beyond factual knowledge.

02

Instruction fine-tuning causes significant drift, while preference optimization is more conservative.

03

No universal method effectively mitigates forgetting across all models.

Abstract

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce \textbf{CapTrack}, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite built on established benchmarks and targeted adaptations. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tri-fair-lab/captrack
dataset· 86 dl
86 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education