Chain-of-Verification Reduces Hallucination in Large Language Models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian, Li, Asli Celikyilmaz, Jason Weston

TL;DR
The paper introduces Chain-of-Verification, a method enabling large language models to self-verify their responses, significantly reducing hallucinations across multiple tasks by iterative fact-checking.
Contribution
It presents a novel self-verification framework for language models that systematically reduces hallucinations through multi-step fact-checking.
Findings
CoVe reduces hallucinations in list-based questions from Wikidata
It improves accuracy in closed book MultiSpanQA tasks
Enhances factual consistency in longform text generation
Abstract
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
Peer Reviews
Decision·Submitted to ICLR 2024
1. **Clarity**. Overall, the writing is clear and easy to follow. In addition, the organization of the main draft is well-established. 2. **Well motivated problem**. Reducing the hallucination and improving the factuality of LLMs is an interesting and important problem. To this end, considering the improved prompting framework is a reasonable and well-motivated direction. 3. **Simple and efficient method.** The proposed method is simple and can be applicable regardless of the types of LLMs. Als
1. **Absence of necessary baselines**. As the authors pointed out in the Related work sections, there are many relevant works based on prompting, to reduce the hallucination of LLMs [1,2,3] or improve LLMs’ reasoning [4,5]. However, these baselines are never compared through the experiments now. Therefore, it’s hard to verify the effectiveness of the proposed CoVe compared to them. 2. **Difficulty of direct comparison**. Currently, only zero-shot (or CoT) results are presented for LLaMA2 and f
- Mitigating hallucinations in LLMs is a timely and important topic - CoVE is simple and widely applicable - Various variants are tested out
- The technical contribution is thin. I had a hard time justifying the paper’s technical contribution since CoVE looks very similar to [1] The two Wikipedia QA datasets (Wikidata and Wiki-category list) are created from simple templates and are rather toyish. I am not sure how much they add to the paper. - Hallucinations are especially tricky to address in generation; the only generation task considered is biography generation, which is way less challenging than most real-world applications - Th
1) The writing is well. 2) The idea of correcting LLM responses by answering and answering verification questions from the model itself is valuable. 3) The proposed method is effective in alleviating hallucination problems.
1) There is an absence of comparative analysis with other methods aimed at mitigating the hallucination issue. It remains to be clarified whether CoVe offers an enhancement in performance relative to other methods. 2) More instances of prompts are required. For example, in Section 3.3, some examples of prompts need to be provided to distinguish between the several variants of verification variants. 3) Needs to provide more examples of the use of CoVe in different tasks.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsSigmoid Activation · Tanh Activation · GloVe Embeddings · Long Short-Term Memory · Sequence to Sequence · Softmax · Bidirectional LSTM · Location-based Attention · Contextual Word Vectors
