Teaching Language Models to Hallucinate Less with Synthetic Tasks

Erik Jones; Hamid Palangi; Clarisse Sim\~oes; Varun Chandrasekaran,; Subhabrata Mukherjee; Arindam Mitra; Ahmed Awadallah; Ece Kamar

arXiv:2310.06827·cs.CL·November 8, 2023·5 cites

Teaching Language Models to Hallucinate Less with Synthetic Tasks

Erik Jones, Hamid Palangi, Clarisse Sim\~oes, Varun Chandrasekaran,, Subhabrata Mukherjee, Arindam Mitra, Ahmed Awadallah, Ece Kamar

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SynTra, a synthetic task-based method to reduce hallucinations in large language models, demonstrating that optimizing system messages on synthetic tasks can transfer to real-world summarization tasks.

Contribution

The paper presents SynTra, a novel approach that uses synthetic tasks to effectively reduce hallucinations in LLMs by optimizing system messages, not model weights.

Findings

01

Synthetic task optimization reduces hallucinations in real tasks.

02

Optimizing system messages is more effective than fine-tuning model weights.

03

SynTra achieves hallucination reduction with only synthetic supervision.

Abstract

Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper addresses a big open problem (hallucinations in LLMs) and stays up-to-date with state-of-the-art advancements in the field (using benchmarks such as ACI-Bench and models such as Vicuna). 2. I like the general “[just ask for generalization](https://www.notion.so/Teaching-language-models-to-hallucinate-less-with-synthetic-tasks-128f91205a1d45208d3add1298c95f18?pvs=21)” approach to reducing hallucinations that does not use human feedback or human demonstrations directly. One could hope

Weaknesses

1. The authors only experiments with one model size (13B) and it’s hard to say how well their method scales. Will be benefits of SynTra increase or decrease for finetuned LLaMA 30B or 65B? What about LLaMA 6.5B? 2. The absolute improvements in hallucination rate (Table 1) don’t strike me as very high. I’m not sure they justify the software complexity of soft system message optimization. 3. Relatedly, it would be good to compare SynTra to some simpler baselines, e.g. finetuning on gold labels for

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The optimized model or system message can be transferred to real-world downstream tasks. Experiments show this reduces hallucination compared to the 13B parameter models Vicuna and Orca on tasks like search-and-retrieve, meeting summarization, and clinical report generation. - Novel idea of using a synthetic task to easily isolate and optimize against hallucinations.

Weaknesses

- Lack of baselines: The proposed method was applied to existing fine-tuned LLMs like Vicuna and Orca and the experiments only show that they become better than the original LLMs. However, we do not know how the quality of the SynTra task compared to other datasets. For example, a more fair comparison would be finetuning two LLMs starting from LLaMA, one on the VIcuna sharegpt data, the other on the SynTra data. I wonder if the effects of the SynTra data are not as much as the Vicuna data, in t

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Originality: Though previous works have characterized hallucination using synthetic tasks, this paper goes a step further by utilizing synthetic data to actively reduce hallucination. Quality and Significance: The paper's relevance is evident, considering the persistent challenge of hallucination in LLMs. By offering a tangible method to evaluate and optimize against such hallucinations, the paper contributes a valuable tool to the domain. Clarity: The paper is generally well-written.

Weaknesses

The evaluation of SYNTRA is somewhat limited, as it focuses on only 2 models and 3 realistic tasks. This raises concerns about the method's ability to generalize across diverse models and tasks. Also, the effect on Vicuna seems marginal. This weak effect could limit the practical utility of SYNTRA, especially if similar results persist across other models. I believe the way the results are presented can be improved. There are some key takeaway messages that are hidden in the table/figure. Impo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification