HALT: Hallucination Assessment via Log-probs as Time series

Ahmad Shapiro; Karan Taneja; Ashok Goel

arXiv:2602.02888·cs.CL·February 4, 2026

HALT: Hallucination Assessment via Log-probs as Time series

Ahmad Shapiro, Karan Taneja, Ashok Goel

PDF

Open Access 3 Reviews

TL;DR

HALT is a lightweight, efficient hallucination detection method for large language models that uses log-probabilities as time series, outperforming larger models in speed and accuracy across diverse tasks.

Contribution

The paper introduces HALT, a novel hallucination detector based on log-probabilities, and HUB, a comprehensive benchmark for evaluating hallucination detection methods.

Findings

01

HALT outperforms fine-tuned BERT-based detectors in accuracy.

02

HALT achieves 60x faster inference speed.

03

HALT generalizes well across multiple LLM capabilities.

Abstract

Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. Recasting log-prob sequences as a time-series classification problem is novel. 2. It only requires token-level log-probs, making it lightweight. And the GRUs only has 5M parameters. 3. The HUB unifies diverse datasets and tasks. It integrates logical hallucinations alongside factual ones.

Weaknesses

1. While HALT performs well, it’s unclear which temporal features (e.g., entropy spikes, rank shifts) most influence decisions. The method could benefit from feature attribution or from visualizing log-probability trajectories to explain predictions. 2. HALT does not transfer well across models (in both Hypothesis 3 and results). This is a practical limitation if detectors must be retrained per LLM. 3. The paper claims the method is attractive for API-based deployments. However, most APIs do not

Reviewer 02Rating 2Confidence 4

Strengths

- The proposed method is targeting log-probabilities to detect hallucinations, which is an interesting middle ground between the generated text and model internals.

Weaknesses

## Major - The experimental design is flawed - The definition of hallucinations that the authors attempt to address is too broad. For instance, the authors define that incorrect reasoning traces as hallucinations. In other words, the authors define any incorrect outputs of models as hallucinations. This definition makes it difficult to properly isolate and understand the reason why these hallucinations happen and why the proposed method is the appropriate solution to it. - In turn, the

Reviewer 03Rating 6Confidence 3

Strengths

1. The approach is practical and lightweight, relying only on token log-probabilities that are available in most APIs. 2. The time-series perspective is conceptually new compared with previous aggregation-based or text-only black-box detectors. 3. Experiments are comprehensive, covering ten capability clusters and multiple public benchmarks. 4. HALT models are compact (around 5M parameters) but outperform larger span-based or encoder-based detectors in most clusters.

Weaknesses

1. Notation inconsistency: in lines 285–289, the notation for ($\ell_t$) is inconsistent. Each term should represent a single log-probability value instead of a vector, and the paper should use one clear format throughout and include a short notation reference to help readers follow the equations. 2. Insufficient details about HALT-L and HALT-Q training and evaluation: It would help to have one concise subsection that summarizes what data each model uses, how teacher-forcing is implemented, and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks