Reasoning Stabilization Point: A Training-Time Signal for Stable Evidence and Shortcut Reliance
Sahil Rajesh Dhayalkar

TL;DR
This paper introduces the Reasoning Stabilization Point (RSP), a training-time signal based on explanation drift that identifies when a model's reliance on evidence stabilizes during fine-tuning, aiding in understanding and monitoring model behavior.
Contribution
The paper proposes RSP, a novel metric derived from explanation drift, to detect stable evidence reliance during fine-tuning without requiring out-of-distribution data, and demonstrates its effectiveness across tasks.
Findings
Explanation drift collapses early in training to a low, stable regime.
RSP can be computed from within-run dynamics without OOD tuning.
Attribution dynamics reveal shortcut reliance even when accuracy is high.
Abstract
Fine-tuning pretrained language models can improve task performance while subtly altering the evidence a model relies on. We propose a training-time interpretability view that tracks token-level attributions across finetuning epochs. We define explanation driftas the epoch-to-epoch change in normalized token attributions on a fixed probe set, and introduce the Reasoning Stabilization Point(RSP), the earliest epoch after which drift remains consistently low. RSP is computed from within-run drift dynamics and requires no tuning on out-of-distribution data. Across multiple lightweight transformer classifiers and benchmark classification tasks, drift typically collapses into a low, stable regime early in training, while validation accuracy continues to change only marginally. In a controlled shortcut setting with label-correlated trigger tokens, attribution dynamics expose increasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
