Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
Tyler Alvarez, Ali Baheri

TL;DR
This paper introduces a novel approach to detect hallucinations in language models by analyzing hidden-state trajectories during reasoning, using geometric features and contrastive PCA to localize errors at the step level.
Contribution
It proposes a transport geometry-based method for step-level hallucination detection, including a contrastive PCA technique and a distilled BiLSTM model, advancing localization and transferability.
Findings
The contrastive PCA projection is optimal for separating first errors from correct states.
The teacher model outperforms baselines across multiple datasets and models.
The student model transfers well within the same distribution but struggles under shift.
Abstract
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
