Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation

Dhrupad Bhardwaj; Julia Kempe; Tim G. J. Rudner

arXiv:2510.21891·cs.CL·October 28, 2025

Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation

Dhrupad Bhardwaj, Julia Kempe, Tim G. J. Rudner

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces semantic isotropy as a simple, efficient metric to evaluate the trustworthiness of long-form responses from large language models, effectively predicting nonfactual content without labeled data.

Contribution

The authors propose a novel, label-free method based on embedding isotropy to assess factual accuracy in LLM outputs, outperforming existing approaches.

Findings

01

Higher semantic isotropy correlates with lower factual consistency.

02

The method requires no fine-tuning or hyperparameter tuning.

03

It outperforms existing approaches in predicting nonfactual responses.

Abstract

To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed approach achieves SOTA performance on factuality checking. 2. The proposed approach is lightweight and computationally efficient. 3. The approach is robust to the embedding models.

Weaknesses

1. This work is developed for long-form text generation. However, the experiments focuse on resonse lengths up to 1000 words, which seems not long enough considering that a lot of LLMs can generate up to 100K tokens. 2. The benchmarking datasets include TriviaQA and FS-BIO, which is limited. It should include more datasets for experiments to ensure the generlizabilit of the proposed approach.

Reviewer 02Rating 4Confidence 3

Strengths

1. I think that the problem is well-motivated and important to solve. I agree that existing approaches are not sufficient. 2. The experiments are pretty comprehensive, with multiple models and baselines. 3. The work contributes a dataset for long-form answer factuality check.

Weaknesses

1. My main critique of the work is about the hypotheses. Specifically, 1. The work assumes that "certainty" and "factuality" are the same and uses them interchangeably. However, even if an LLM is certain of a fact and regenerates similar text when resampled, it may not be factually correct. Clarifying the definition of factuality assumed by the work would be useful here. 2. The semantic isotropy score depends on the quality of the embeddings. While the authors have an extensive ablation

Reviewer 03Rating 6Confidence 3

Strengths

- The authors identify an important problem: We need fact-checking systems more than ever, and existing claim-by-claim verifiers are extremely computationally costly, given that the models that do well across varied domains tend to be exorbitantly expensive when run over entire documents. - The approach does not require fine-tuning or any training data, and can be used with closed-weight models. - The approach is evaluated across multiple domains. - The evaluation results of this system acros

Weaknesses

- How does the approach handle explicit disinformation, where the incorrect claim under consideration is intentionally hidden? Claim-by-claim verifiers are generally able to detect these, but this seems to not be the use case of this method. More detail regarding when this approach should be used over claim-by-claim verification and potential pitfalls would be helpful. - The authors claim that "[existing claim verification systems] struggle with open-ended, multi-sentence answers where relevant

Code & Models

Datasets

dhrupadb/SegmentScore
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.