Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Maciej Chrab\k{a}szcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzci\'nski, Sebastian Cygert

TL;DR
This paper introduces probe trajectories as a new method to monitor reasoning dynamics in large reasoning models, improving the prediction of future behavior through temporal analysis of hidden representations.
Contribution
It presents a novel approach using probe trajectories and signal-processing features to better understand and predict model behavior during reasoning tasks.
Findings
Trajectory analysis outperforms static predictions in behavior distinguishability.
Max-pooling yields up to 95% AUROC in outcome prediction.
Template-based data achieves similar results to dynamic model responses.
Abstract
Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
