Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection
Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau

TL;DR
This paper introduces a two-step synthetic data generation method for hallucination detection, improving detector robustness and generalization across tasks and generators.
Contribution
It proposes a novel task-specific synthetic data generation pipeline with pattern guidance and style alignment, enhancing hallucination detector training.
Findings
Synthetic datasets improve hallucination detector accuracy.
Detectors trained on synthetic data outperform ICL-based detectors by 32%.
Method generalizes well across different tasks and generators.
Abstract
We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics · Functional Brain Connectivity Studies · Anomaly Detection Techniques and Applications
