Contrastive Error Attribution for Finetuned Language Models
Faisal Ladhak, Esin Durmus, Tatsunori Hashimoto

TL;DR
This paper introduces a contrastive error attribution method to identify and remove low-quality training data, significantly reducing hallucinations and errors in language models' outputs.
Contribution
It proposes a novel contrast-based error tracing technique that outperforms existing methods in detecting data errors affecting model faithfulness.
Findings
Achieves 0.93 mean average precision in error detection
Reduces entity hallucinations by 70% on NYT dataset
Reduces semantic errors by 55% on E2E dataset
Abstract
Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in NLG datasets. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.93 at detecting known data errors across synthetic tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
