Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Yiming Wang; Pei Zhang; Baosong Yang; Derek F. Wong; Rui Wang

arXiv:2410.13640·cs.CL·March 14, 2025

Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, Rui Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a novel latent space chain-of-embedding method that enables large language models to perform output-free self-evaluation of response correctness in real-time, without additional training.

Contribution

The proposed Chain-of-Embedding leverages internal hidden states for self-assessment, providing a training-free, efficient, and effective approach for LLM response correctness estimation.

Findings

01

Effective across four domains and seven LLMs

02

Label-free with millisecond-level computational cost

03

Provides insights into internal hidden state dynamics

Abstract

LLM self-evaluation relies on the LLM's own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensures real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

It is quite interesting to discover that the hidden states of LLMs can be utilized to estimate the LLM response correctness without any label, which inspires me a lot. Moreover, it is also reasonable to treat the progressive hidden states as the latent thinking path of LLMs, leading to the following assumption: CoE discrepancies may happen when LLMs generate correct and incorrect responses. Overall, it is quite inspireable paper and give the readers a lot of insights on the thinking system of

Weaknesses

I donot think this paper have any significant weaknesses

Reviewer 02Rating 6Confidence 3

Strengths

1. The literature of this paper is comprehensive. 2. The exploration is structured and natural. From intuitive cognitive phenomenon to observation, the authors verify the assumpted discrepancy between correct and incorrect generation. Then, the authors proposed a metric and then demonstrated the effectiveness of the metric. The idea is well-founded. 3. The authors verify their proposed metrics on various backbones and datasets. 4. The authors give a detailed analysis of many aspects.

Weaknesses

1. Missing critical statistic analysis of discrepancies. The foundation of the proposed metrics is based on the existence of the discrepancy between correct and incorrect generation. However, only visualizing this discrepancy is not enough. 2. Why is the targeted task self-evaluation? self-evaluation is a task that seems appealing but is quite ambiguous and questionable; even if the reviewer reads the reference paper cited by authors for self-evaluation, there is no task definition of self-eval

Reviewer 03Rating 6Confidence 1

Strengths

1. CoE uses internal hidden states rather than output-based confidence, enhancing adaptability to new tasks without training data. 2. With millisecond-level computation, CoE scales effectively, offering quick feedback for large-scale deployments. 3. CoE provides insights into LLM decision-making by tracing hidden states, offering a transparent, human-like assessment of response correctness.

Weaknesses

1. My major concern is the scenario and the necessity for the label-free scenario. If we limited the LLM query into only math question, commonQA questions, then the label-free scenario may not be necessary. A few-shot human labeling can achieve satisfied performance. The key reason for the label-free is the open-world scenario, which the query can be arbitrary from various domains. It is impossible for us to label across queries with different purposes. Considering the open-world assumption, th

Code & Models

Repositories

alsace08/chain-of-embedding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques