Improving Uncertainty Estimation through Semantically Diverse Language Generation
Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter

TL;DR
This paper introduces SDLG, a method that guides large language models to generate semantically diverse alternatives, improving the detection of hallucinations and uncertainty estimation in text generation tasks.
Contribution
The paper presents a novel semantically diverse language generation technique that enhances uncertainty quantification and hallucination detection in large language models.
Findings
SDLG outperforms existing uncertainty estimation methods.
SDLG is computationally efficient.
SDLG improves hallucination detection in question-answering tasks.
Abstract
Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks…
Peer Reviews
Decision·ICLR 2025 Poster
The paper proposes an innovative and effective method to quantify uncertainty in language models, addressing a significant challenge in NLG. The approach of considering semantic diversity rather than relying on traditional sampling methods for uncertainty estimation is both straightforward and sound. By integrating a smaller NLI model, the paper also makes strides in computational efficiency, making it practical for real-world applications. The empirical results demonstrated across multiple data
- Since the method relies heavily on the NLI model to assess semantic equivalence, any biases inherent in the NLI model could skew the uncertainty estimations. The paper lacks a discussion on how to handle or mitigate potential biases within the NLI models, which could affect the reliability of the uncertainty measurements. I recommend including a more comprehensive evaluation of how variations in NLI models might affect the uncertainty measurement. - Also the dependency on an external NLI mode
- The paper provides a clear explanation of uncertainty estimation for LLMs, introducing a method to calculate semantic similarity in a simpler manner than existing clustering-based approaches. - The use of Importance Sampling for generating semantically diverse outputs is a strong methodological choice and, in my view, the paper’s most significant contribution.
- **Experimental Clarity**: The experimental section lacks clarity, particularly in explaining how ROUGE and BLEURT metrics were applied. The reference in L.409 (“in general...”) requires citation if it’s a general principle, and L.411-414 discussing AUROC are somewhat confusing, particularly the sentence “AUROC is used as a metric for classifying.” - **Baselines and Related Work**: It is unclear why some relevant works, such as [1], are not included as baselines. Additionally, a recent paper [2
Positive points include the following: - The paper is clearly written and easy to follow. - The simplicity of the method is appealing. - The evaluation setup closely follows prior work by [Kuhn et al. (2023)](https://arxiv.org/pdf/2302.09664) measuring AUROC on the same three datasets and with the same OPT models. - The paper uses open-access LLMs (OPT family) as Kuhn et al., (2023) and replicates their results, with makes them comparable. - Compared to Kuhn et al. (2023), the paper includes a
Weaknesses include the following: - The new method relies on three scores, but their importance remains untested. - The three scores are integral to the newly proposed method, but the paper lacks an ablation of the three scores (Ai, Sij , Iij ). The current evaluation uses a simple mean (line 426-428) "we derive the final token score ranking by straightforwardly averaging the three individual token score". I wonder whether all three scores (one for the token and two for alternative tokens)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
