AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Max Henning H\"oth, Kristian Kersting, Bj\"orn Deiseroth, Letitia Parcalabescu

TL;DR
AtManRL introduces a reinforcement learning method that uses differentiable attention saliency to improve the faithfulness and interpretability of reasoning in large language models.
Contribution
It presents a novel attention manipulation technique combined with a saliency reward to enhance reasoning transparency in LLMs.
Findings
The method identifies influential reasoning tokens in LLM outputs.
AtManRL improves the faithfulness of reasoning traces in GSM8K and MMLU tasks.
The approach enables training more transparent and interpretable reasoning models.
Abstract
Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
