Localizing and Mitigating Errors in Long-form Question Answering
Rachneet Sachdeva, Yixiao Song, Mohit Iyyer, Iryna Gurevych

TL;DR
This paper introduces HaluQuestQA, a dataset with localized error annotations for long-form question answering, and develops models and methods to detect, analyze, and reduce errors, significantly improving answer quality.
Contribution
The work presents the first hallucination dataset with span-level error annotations for LFQA, along with a feedback model and a prompt-based refinement approach to enhance answer accuracy.
Findings
Error-informed refinement reduces hallucinations in generated answers.
Humans prefer answers refined with our method (84%).
The dataset enables detailed analysis of LFQA errors.
Abstract
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSeismology and Earthquake Studies
