Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation
Dhrupad Bhardwaj, Julia Kempe, Tim G. J. Rudner

TL;DR
This paper introduces semantic isotropy as a simple, efficient metric to evaluate the trustworthiness of long-form responses from large language models, effectively predicting nonfactual content without labeled data.
Contribution
The authors propose a novel, label-free method based on embedding isotropy to assess factual accuracy in LLM outputs, outperforming existing approaches.
Findings
Higher semantic isotropy correlates with lower factual consistency.
The method requires no fine-tuning or hyperparameter tuning.
It outperforms existing approaches in predicting nonfactual responses.
Abstract
To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed approach achieves SOTA performance on factuality checking. 2. The proposed approach is lightweight and computationally efficient. 3. The approach is robust to the embedding models.
1. This work is developed for long-form text generation. However, the experiments focuse on resonse lengths up to 1000 words, which seems not long enough considering that a lot of LLMs can generate up to 100K tokens. 2. The benchmarking datasets include TriviaQA and FS-BIO, which is limited. It should include more datasets for experiments to ensure the generlizabilit of the proposed approach.
1. I think that the problem is well-motivated and important to solve. I agree that existing approaches are not sufficient. 2. The experiments are pretty comprehensive, with multiple models and baselines. 3. The work contributes a dataset for long-form answer factuality check.
1. My main critique of the work is about the hypotheses. Specifically, 1. The work assumes that "certainty" and "factuality" are the same and uses them interchangeably. However, even if an LLM is certain of a fact and regenerates similar text when resampled, it may not be factually correct. Clarifying the definition of factuality assumed by the work would be useful here. 2. The semantic isotropy score depends on the quality of the embeddings. While the authors have an extensive ablation
- The authors identify an important problem: We need fact-checking systems more than ever, and existing claim-by-claim verifiers are extremely computationally costly, given that the models that do well across varied domains tend to be exorbitantly expensive when run over entire documents. - The approach does not require fine-tuning or any training data, and can be used with closed-weight models. - The approach is evaluated across multiple domains. - The evaluation results of this system acros
- How does the approach handle explicit disinformation, where the incorrect claim under consideration is intentionally hidden? Claim-by-claim verifiers are generally able to detect these, but this seems to not be the use case of this method. More detail regarding when this approach should be used over claim-by-claim verification and potential pitfalls would be helpful. - The authors claim that "[existing claim verification systems] struggle with open-ended, multi-sentence answers where relevant
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
