Toward Honest Language Models for Deductive Reasoning
Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

TL;DR
This paper investigates how to make language models reason honestly by abstaining when conclusions are not entailed, proposing a reinforcement learning method that improves their ability to do so on graph-based deductive tasks.
Contribution
It introduces ACNCHOR, a reinforcement learning approach that stabilizes training and enhances honest deductive reasoning in language models, addressing limitations of existing methods.
Findings
Prompting and current training methods struggle with honest reasoning.
ACNCHOR stabilizes training and improves reasoning performance.
Ground truth trajectories prevent early training collapse.
Abstract
Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
