CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Maciej Besta; Lorenzo Paleari; Marcin Copik; Robert Gerstenberger; Ales Kubicek; Piotr Nyczyk; Patrick Iff; Eric Schreiber; Tanja Srindran; Tomasz Lehmann; Hubert Niewiadomski; Torsten Hoefler

arXiv:2406.02524·cs.CL·July 11, 2025·1 cites

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

PDF

Open Access 1 Repo 3 Reviews

TL;DR

CheckEmbed (CE) is a scalable, embedding-based verification method for LLM outputs that effectively detects hallucinations and generalizes across modalities, outperforming prior token-level approaches.

Contribution

Introducing CheckEmbed, a novel embedding-based verification technique that improves accuracy and scalability for evaluating LLM solutions on open-ended tasks.

Findings

01

CE reliably detects hallucinations in LLM outputs.

02

CE outperforms token-based methods like BERTScore.

03

CE generalizes to non-text modalities such as vision.

Abstract

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

The paper addresses a crucial challenge in verification of open-ended LLM tasks The proposed method is straightforward and practical to implement

Weaknesses

1. Despite claiming to verify open-ended tasks, the method effectively only works for hallucination detection tasks with definitive answers. Multiple divergent responses could all be valid for truly open-ended queries (e.g., "Tell me a joke"), which the method fails to accommodate. 2. Regarding Contributions 1 and 3, computing cosine similarity between text embeddings is a well-established approach. BERTScore and SelfCheckGPT intentionally utilize token-level and sentence-level information to o

Reviewer 02Rating 3Confidence 4

Strengths

The paper was generally well-written and easy to follow. The figures are generally very clear (in particular, Figure 2 was very helpful to quickly visualize the differences between approaches). The motivation seemed fairly clear -- namely, targeting a more efficient and accurate method for comparing generated texts. The technique largely leverages a simple embedding method, however, they explore methods for aggregating this technique through repeated LLM sampling.

Weaknesses

- Narrow evaluation scope: Much of the evaluation is conducted on synthetic or in-house data, raising questions about the generalizability of the results. The strongest improvements are seen on these datasets, while for the WikiBio task, other metrics—such as the NLI variant of SelfCheckGPT—demonstrate comparable performance. A more extensive evaluation on public datasets would provide stronger evidence for the method’s general applicability. - Lack of novelty: Generally, this seems like a stra

Reviewer 03Rating 3Confidence 4

Strengths

I think comparing embeddings of replies (reply -> embed(reply)) should be much more effective than using GPT to compare two replies directly. If we can maintain performance while increasing preprocessing time, that would be ideal.

Weaknesses

- The paper's main claim about effectiveness is questionable. If the performance improvements primarily stem from NumPy's implementation of faster algorithms for computing cosine or correlation scores, then the scientific contribution is overclaimed. - The claim about paragraph-level comparison versus token-level comparison improving performance is suspicious. Lines 149-150 indicate that the main methods are cosine similarity and Pearson correlation between paragraphs. However, computing cosine

Code & Models

Repositories

spcl/checkembed
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Reliability and Analysis Research · Business Process Modeling and Analysis · Software Engineering Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Byte Pair Encoding