Chain-of-Verification Reduces Hallucination in Large Language Models

Shehzaad Dhuliawala; Mojtaba Komeili; Jing Xu; Roberta Raileanu; Xian; Li; Asli Celikyilmaz; Jason Weston

arXiv:2309.11495·cs.CL·September 26, 2023·39 cites

Chain-of-Verification Reduces Hallucination in Large Language Models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian, Li, Asli Celikyilmaz, Jason Weston

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

The paper introduces Chain-of-Verification, a method enabling large language models to self-verify their responses, significantly reducing hallucinations across multiple tasks by iterative fact-checking.

Contribution

It presents a novel self-verification framework for language models that systematically reduces hallucinations through multi-step fact-checking.

Findings

01

CoVe reduces hallucinations in list-based questions from Wikidata

02

It improves accuracy in closed book MultiSpanQA tasks

03

Enhances factual consistency in longform text generation

Abstract

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. **Clarity**. Overall, the writing is clear and easy to follow. In addition, the organization of the main draft is well-established. 2. **Well motivated problem**. Reducing the hallucination and improving the factuality of LLMs is an interesting and important problem. To this end, considering the improved prompting framework is a reasonable and well-motivated direction. 3. **Simple and efficient method.** The proposed method is simple and can be applicable regardless of the types of LLMs. Als

Weaknesses

1. **Absence of necessary baselines**. As the authors pointed out in the Related work sections, there are many relevant works based on prompting, to reduce the hallucination of LLMs [1,2,3] or improve LLMs’ reasoning [4,5]. However, these baselines are never compared through the experiments now. Therefore, it’s hard to verify the effectiveness of the proposed CoVe compared to them. 2. **Difficulty of direct comparison**. Currently, only zero-shot (or CoT) results are presented for LLaMA2 and f

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- Mitigating hallucinations in LLMs is a timely and important topic - CoVE is simple and widely applicable - Various variants are tested out

Weaknesses

- The technical contribution is thin. I had a hard time justifying the paper’s technical contribution since CoVE looks very similar to [1] The two Wikipedia QA datasets (Wikidata and Wiki-category list) are created from simple templates and are rather toyish. I am not sure how much they add to the paper. - Hallucinations are especially tricky to address in generation; the only generation task considered is biography generation, which is way less challenging than most real-world applications - Th

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1) The writing is well. 2) The idea of correcting LLM responses by answering and answering verification questions from the model itself is valuable. 3) The proposed method is effective in alleviating hallucination problems.

Weaknesses

1) There is an absence of comparative analysis with other methods aimed at mitigating the hallucination issue. It remains to be clarified whether CoVe offers an enhancement in performance relative to other methods. 2) More instances of prompts are required. For example, in Section 3.3, some examples of prompts need to be provided to distinguish between the several variants of verification variants. 3) Needs to provide more examples of the use of CoVe in different tasks.

Code & Models

Repositories

lastmile-ai/aiconfig/tree/main/cookbooks/Chain-of-Verification
none

Videos

Chain-of-Verification Reduces Hallucination in Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsSigmoid Activation · Tanh Activation · GloVe Embeddings · Long Short-Term Memory · Sequence to Sequence · Softmax · Bidirectional LSTM · Location-based Attention · Contextual Word Vectors