Attention Head Entropy of LLMs Predicts Answer Correctness
Sophie Ostmeier, Brian Axelrod, Maya Varma, Asad Aali, Yabin Zhang, Magdalini Paschali, Sanmi Koyejo, Curtis Langlotz, Akshay Chaudhari

TL;DR
This paper introduces Head Entropy, a method that predicts answer correctness in LLMs by analyzing attention entropy patterns, demonstrating improved accuracy and generalization across domains and models.
Contribution
The paper presents Head Entropy, a novel white-box approach that uses attention entropy to predict answer correctness and generalizes well out-of-domain.
Findings
Head Entropy matches or exceeds baselines in-distribution.
It outperforms baselines by +8.5% AUROC out-of-domain.
Attention patterns over questions alone predict correctness with +17.7% AUROC.
Abstract
Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The idea is quite simple, applicable to all Transformer-based LLMs. - The writing is good with clear definitions and terminology.
- About entropy calculation - In Sec. 4.1, I cannot see how to process the attention entropy of different layers and different sections, and thus, I'm curious about the specific shape of $H$ $ in Equ. 8. - About generalizability and applicability: - This paper focuses only on closed-ended factual questions with ground truth answers, and thus, we can train a binary logistic regressor. It suggests that this method cannot generalize to open-ended QA like chatting, and thus, is not as general as
1. Strong Empirical Results: The method consistently outperforms baselines by meaningful margins across diverse QA tasks (TriviaQA, HotpotQA, MedMCQA) and multiple model families (Qwen, Llama). The 0.07-0.15 AUROC improvements are substantial. 2. Practical Efficiency: The approach adds minimal computational overhead (negligible compared to LLM inference) while requiring only a single forward pass. This makes it genuinely deployable in real systems. 3. Mechanistic Interpretability: Using Shaple
The evaluation scope is a significant limitation. The paper only evaluates on three QA datasets focusing on factual retrieval, which limits the generalizability of findings. More concerning, the experiments are restricted to instruction-tuned models, leaving unclear whether the approach works equally well for base models or other architectures like mixture-of-experts or retrieval-augmented models. The medical domain (MedMCQA) performance is notably weaker, showing only 0.05 AUROC improvement com
1. The paper is clearly written and easy to follow. The proposed method is conceptually straightforward and well-presented. 2. The authors evaluate their approach on five instruction-tuned LLMs and three QA datasets. Results show that head entropy consistently outperforms baseline uncertainty metrics and generalizes well across different model families and sizes, demonstrating both robustness and applicability.
1. The proposed entropy measure appears highly dependent on the specific query content. Even though entropies are averaged over all tokens within an answer, the resulting entropy-based correctness estimates should still be regarded as query-conditional rather than global indicators of confidence. The model’s attention behavior and therefore its entropy varies strongly with input semantics, which could limit generalization across queries or domains. I appreciate if the authors provide any evidenc
- The paper is well written. - The idea is simple and effective. - There are good ablation studies and, more importantly, OOD experiments. - The experiments are comprehensive, and the results are promising
- The most important weakness of the paper is its novelty. There is already a popular work which uses attention maps to extract features and train a classifier model: https://arxiv.org/pdf/2407.07071. The only difference between this work and the other work is how to extract the feature. - The data scaling/low data experiments are missing. I would like to see how the performance changes with less or more data. - More insights about why this idea works could be helpful. For instance, why is entr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications
