Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection
Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding

TL;DR
This paper introduces HaluCheck, a curriculum learning approach using synthetic hallucination negatives in DPO training, significantly enhancing LLM hallucination detection and robustness across benchmarks.
Contribution
It presents a novel curriculum DPO method with synthetic negatives, improving hallucination detection in LLMs beyond existing techniques.
Findings
Up to 24% improvement on challenging benchmarks.
Enhanced zero-shot robustness over larger models.
Effective use of synthetic hallucination negatives.
Abstract
Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDirect Preference Optimization
