CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

TL;DR
CheckEmbed (CE) is a scalable, embedding-based verification method for LLM outputs that effectively detects hallucinations and generalizes across modalities, outperforming prior token-level approaches.
Contribution
Introducing CheckEmbed, a novel embedding-based verification technique that improves accuracy and scalability for evaluating LLM solutions on open-ended tasks.
Findings
CE reliably detects hallucinations in LLM outputs.
CE outperforms token-based methods like BERTScore.
CE generalizes to non-text modalities such as vision.
Abstract
Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper addresses a crucial challenge in verification of open-ended LLM tasks The proposed method is straightforward and practical to implement
1. Despite claiming to verify open-ended tasks, the method effectively only works for hallucination detection tasks with definitive answers. Multiple divergent responses could all be valid for truly open-ended queries (e.g., "Tell me a joke"), which the method fails to accommodate. 2. Regarding Contributions 1 and 3, computing cosine similarity between text embeddings is a well-established approach. BERTScore and SelfCheckGPT intentionally utilize token-level and sentence-level information to o
The paper was generally well-written and easy to follow. The figures are generally very clear (in particular, Figure 2 was very helpful to quickly visualize the differences between approaches). The motivation seemed fairly clear -- namely, targeting a more efficient and accurate method for comparing generated texts. The technique largely leverages a simple embedding method, however, they explore methods for aggregating this technique through repeated LLM sampling.
- Narrow evaluation scope: Much of the evaluation is conducted on synthetic or in-house data, raising questions about the generalizability of the results. The strongest improvements are seen on these datasets, while for the WikiBio task, other metrics—such as the NLI variant of SelfCheckGPT—demonstrate comparable performance. A more extensive evaluation on public datasets would provide stronger evidence for the method’s general applicability. - Lack of novelty: Generally, this seems like a stra
I think comparing embeddings of replies (reply -> embed(reply)) should be much more effective than using GPT to compare two replies directly. If we can maintain performance while increasing preprocessing time, that would be ideal.
- The paper's main claim about effectiveness is questionable. If the performance improvements primarily stem from NumPy's implementation of faster algorithms for computing cosine or correlation scores, then the scientific contribution is overclaimed. - The claim about paragraph-level comparison versus token-level comparison improving performance is suspicious. Lines 149-150 indicate that the main methods are cosine similarity and Pearson correlation between paragraphs. However, computing cosine
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Business Process Modeling and Analysis · Software Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Byte Pair Encoding
