GRATH: Gradual Self-Truthifying for Large Language Models
Weixin Chen, Dawn Song, Bo Li

TL;DR
GRATH is a novel self-supervised post-processing method that iteratively improves the truthfulness of large language models by using pairwise training data and preference optimization, achieving state-of-the-art results on TruthfulQA.
Contribution
The paper introduces GRATH, a new iterative self-truthifying approach that enhances LLM truthfulness without sacrificing other capabilities, outperforming larger models on benchmarks.
Findings
GRATH improves truthfulness of 7B-LLMs significantly.
Achieves state-of-the-art accuracy on TruthfulQA.
Enhances model truthfulness without degrading performance.
Abstract
Truthfulness is paramount for large language models (LLMs) as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
