Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
Vaibhav Adlakha, Parishad BehnamGhader, Xing Han Lu, Nicholas Meade,, Siva Reddy

TL;DR
This paper evaluates instruction-following models for question answering, highlighting their strengths in correctness but also their tendency to hallucinate, and proposes improved metrics for more accurate assessment.
Contribution
It introduces new evaluation metrics for correctness and faithfulness, addressing limitations of traditional metrics, and provides a comprehensive analysis of these models' performance.
Findings
Instruction-following models are competitive with fine-tuned models in correctness.
Models often hallucinate and deviate from provided knowledge.
Proposed metrics better reflect true model performance.
Abstract
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
