Teaching Language Models to Hallucinate Less with Synthetic Tasks
Erik Jones, Hamid Palangi, Clarisse Sim\~oes, Varun Chandrasekaran,, Subhabrata Mukherjee, Arindam Mitra, Ahmed Awadallah, Ece Kamar

TL;DR
This paper introduces SynTra, a synthetic task-based method to reduce hallucinations in large language models, demonstrating that optimizing system messages on synthetic tasks can transfer to real-world summarization tasks.
Contribution
The paper presents SynTra, a novel approach that uses synthetic tasks to effectively reduce hallucinations in LLMs by optimizing system messages, not model weights.
Findings
Synthetic task optimization reduces hallucinations in real tasks.
Optimizing system messages is more effective than fine-tuning model weights.
SynTra achieves hallucination reduction with only synthetic supervision.
Abstract
Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper addresses a big open problem (hallucinations in LLMs) and stays up-to-date with state-of-the-art advancements in the field (using benchmarks such as ACI-Bench and models such as Vicuna). 2. I like the general “[just ask for generalization](https://www.notion.so/Teaching-language-models-to-hallucinate-less-with-synthetic-tasks-128f91205a1d45208d3add1298c95f18?pvs=21)” approach to reducing hallucinations that does not use human feedback or human demonstrations directly. One could hope
1. The authors only experiments with one model size (13B) and it’s hard to say how well their method scales. Will be benefits of SynTra increase or decrease for finetuned LLaMA 30B or 65B? What about LLaMA 6.5B? 2. The absolute improvements in hallucination rate (Table 1) don’t strike me as very high. I’m not sure they justify the software complexity of soft system message optimization. 3. Relatedly, it would be good to compare SynTra to some simpler baselines, e.g. finetuning on gold labels for
- The optimized model or system message can be transferred to real-world downstream tasks. Experiments show this reduces hallucination compared to the 13B parameter models Vicuna and Orca on tasks like search-and-retrieve, meeting summarization, and clinical report generation. - Novel idea of using a synthetic task to easily isolate and optimize against hallucinations.
- Lack of baselines: The proposed method was applied to existing fine-tuned LLMs like Vicuna and Orca and the experiments only show that they become better than the original LLMs. However, we do not know how the quality of the SynTra task compared to other datasets. For example, a more fair comparison would be finetuning two LLMs starting from LLaMA, one on the VIcuna sharegpt data, the other on the SynTra data. I wonder if the effects of the SynTra data are not as much as the Vicuna data, in t
Originality: Though previous works have characterized hallucination using synthetic tasks, this paper goes a step further by utilizing synthetic data to actively reduce hallucination. Quality and Significance: The paper's relevance is evident, considering the persistent challenge of hallucination in LLMs. By offering a tangible method to evaluate and optimize against such hallucinations, the paper contributes a valuable tool to the domain. Clarity: The paper is generally well-written.
The evaluation of SYNTRA is somewhat limited, as it focuses on only 2 models and 3 realistic tasks. This raises concerns about the method's ability to generalize across diverse models and tasks. Also, the effect on Vicuna seems marginal. This weak effect could limit the practical utility of SYNTRA, especially if similar results persist across other models. I believe the way the results are presented can be improved. There are some key takeaway messages that are hidden in the table/figure. Impo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
