AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Max Henning H\"oth; Kristian Kersting; Bj\"orn Deiseroth; Letitia Parcalabescu

arXiv:2604.16158·cs.CL·April 20, 2026

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Max Henning H\"oth, Kristian Kersting, Bj\"orn Deiseroth, Letitia Parcalabescu

PDF

TL;DR

AtManRL introduces a reinforcement learning method that uses differentiable attention saliency to improve the faithfulness and interpretability of reasoning in large language models.

Contribution

It presents a novel attention manipulation technique combined with a saliency reward to enhance reasoning transparency in LLMs.

Findings

01

The method identifies influential reasoning tokens in LLM outputs.

02

AtManRL improves the faithfulness of reasoning traces in GSM8K and MMLU tasks.

03

The approach enables training more transparent and interpretable reasoning models.

Abstract

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.