Reasoning Stabilization Point: A Training-Time Signal for Stable Evidence and Shortcut Reliance

Sahil Rajesh Dhayalkar

arXiv:2601.11625·cs.AI·January 21, 2026

Reasoning Stabilization Point: A Training-Time Signal for Stable Evidence and Shortcut Reliance

Sahil Rajesh Dhayalkar

PDF

Open Access

TL;DR

This paper introduces the Reasoning Stabilization Point (RSP), a training-time signal based on explanation drift that identifies when a model's reliance on evidence stabilizes during fine-tuning, aiding in understanding and monitoring model behavior.

Contribution

The paper proposes RSP, a novel metric derived from explanation drift, to detect stable evidence reliance during fine-tuning without requiring out-of-distribution data, and demonstrates its effectiveness across tasks.

Findings

01

Explanation drift collapses early in training to a low, stable regime.

02

RSP can be computed from within-run dynamics without OOD tuning.

03

Attribution dynamics reveal shortcut reliance even when accuracy is high.

Abstract

Fine-tuning pretrained language models can improve task performance while subtly altering the evidence a model relies on. We propose a training-time interpretability view that tracks token-level attributions across finetuning epochs. We define explanation driftas the epoch-to-epoch change in normalized token attributions on a fixed probe set, and introduce the Reasoning Stabilization Point(RSP), the earliest epoch after which drift remains consistently low. RSP is computed from within-run drift dynamics and requires no tuning on out-of-distribution data. Across multiple lightweight transformer classifiers and benchmark classification tasks, drift typically collapses into a low, stable regime early in training, while validation accuracy continues to change only marginally. In a controlled shortcut setting with label-correlated trigger tokens, attribution dynamics expose increasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning