Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

Jianxiong Zhang; Bing Guo; Yuming Jiang; Haobo Wang; Bo An; Sean Du

arXiv:2601.17467·cs.LG·May 6, 2026

Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

Jianxiong Zhang, Bing Guo, Yuming Jiang, Haobo Wang, Bo An, Sean Du

PDF

1 Repo

TL;DR

This paper introduces ARS, a method that improves hallucination detection in large reasoning models by shaping trace representations through answer agreement and counterfactual perturbations.

Contribution

ARS is a novel approach that learns detection-friendly representations by explicitly encoding answer stability without requiring human annotations.

Findings

01

ARS improves hallucination detection accuracy.

02

ARS outperforms strong baseline detectors.

03

ARS requires no human annotations during training.

Abstract

Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

radiolab-ntu/ars_icml2026
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.