Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient
Changming Li, Kaixing Zhang, Haoyun Xu, Yingdong Shi, Zheng Zhang, Kaitao Song, Kan Ren

TL;DR
This paper introduces Integrated Policy Gradient (IPG), a novel method for interpreting and controlling LLM reasoning by attributing reasoning behaviors to internal components through outcome-based signals, improving localization and modulation of reasoning processes.
Contribution
The paper presents IPG, a new framework that enhances interpretability and control of LLM reasoning by propagating outcome signals backward through model trajectories.
Findings
IPG achieves more precise localization of reasoning components.
IPG enables reliable modulation of reasoning capabilities.
Empirical results show improved interpretability and control across models.
Abstract
Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes…
Peer Reviews
Decision·Submitted to ICLR 2026
The use of the policy gradient / score function trick to estimate gradient-based attribution is very clever. This was thought provoking for me to think about how PG can be applied generally to interpretability. I think this is both original and potentially significant. The writing is generally easy to follow. The motivation to move beyond correlation-based methods is justified, and the main claims seem supported.
It may be due to my lack of expertise in interpretability, but there are some problem set-up and design choices I do not fully understand. I discuss them in the questions below. It would also be nice to have not only math and reasoning performance, but steering for harmfulness, for example. Formatting issues: L246-247, figure is blocking main text (likely due to negative vspace)
See summary
See summary
1. Addresses an important goal — locating reasoning features inside LLMs and offering intervention methods rather than purely black‐box fine-tuning. 2. Builds on long‐horizon rewards rather than purely prediction loss, which is underexplored in hidden‐state attribution. 3. The empirical gains shown (some improvement when scaling selected neurons) at least suggest there is signal in the method.
1. IPG’s experiments manipulate hidden states (Eq.~(4) scaling), which is more akin to an optimization/steering method rather than a genuine causal attribution. Results resemble those from EM‐PG (arXiv:2504.18587), which focus on optimizing reasoning behaviours rather than tracing hidden “circuits”. 2. Eq.~(3)’s selection statistic \(S_i\) aggregates across samples without variance control or normalization. Considering the off‐policy nature of PG in RPG, how is sample bias handled? The more ri
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
