LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Yuchen Fan; Chen Lin; Xin Zhong; Shuo Zhang; Heng Zhou; Yuchen Zhang; Mingyu Liang; Chengxing Xie; Ermo Hua; Gang Chen; Zhizhou He; Cheng Huang; Ning Ding; Bowen Zhou

arXiv:2410.01945·cs.CL·February 3, 2026

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Yuchen Fan, Chen Lin, Xin Zhong, Shuo Zhang, Heng Zhou, Yuchen Zhang, Mingyu Liang, Chengxing Xie, Ermo Hua, Gang Chen, Zhizhou He, Cheng Huang, Ning Ding, Bowen Zhou

PDF

Open Access 3 Reviews

TL;DR

LFQA-E is a comprehensive multilingual benchmark for evaluating automatic metrics in long-form question answering, revealing current metrics' limitations in matching human judgment and guiding future improvements.

Contribution

Introduces LFQA-E, a large, diverse, reference-based benchmark for assessing LFQA evaluation metrics, filling a critical gap in the field.

Findings

01

Existing metrics do not match human judgments well.

02

Most metrics fail to capture dense information in long responses.

03

LFQA-E enables thorough evaluation of evaluation metrics.

Abstract

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

S1) The paper addresses an important, underexplored issue: reliable automatic evaluation for paragraph-length, information-dense LFQA answers. The motivation and relation to prior small/weak benchmarks are convincing. S2) The paper offers a new substantial and diverse dataset for LFQA evaluation, offering 1,625 Questions and 7.6k comparisons, multilinguality (EN/ZH), and topic breadth. Statistics and careful sourcing are provided. S3) The annotation recipe is thorough. Using experts, double

Weaknesses

O1) The English references use the “top-ranked ELI5 answer” as candidate references. Top reddit answers are not guaranteed to be expert or correct; relying on them (even after expert review) risks reference quality variance. Although the authors state expert review, more detail is needed: how often did experts modify reddit content, and can we get quantitative measures of reference quality across sources? O2) Annotators are paid $2 per question (4–6 comparisons). This low pay risks rushed judg

Reviewer 02Rating 8Confidence 4

Strengths

The main strengths of the paper are as follows: 1. The paper identifies important limitations of the existing LFQA benchmarks like lack of authorized references and limited diversity and proposes a new large scale benchmark for better LFQA evaluation. 2. The benchmark is very diverse and spans 15 topics and domains. 3. Human experts based validation to makes the benchmark trustworthy. 4. Detailed comparison of standard evaluation metrics on a variety of reasoning models using frontier models li

Weaknesses

The main weaknesses of the paper are as follows: 1. The comparisons of evaluation results using previous LFQA benchmarks using frontier models are missing. 2. The paper writing and presentation can be improved.

Reviewer 03Rating 4Confidence 4

Strengths

The goal of effective evaluation of long responses is clearly an important for AI/NLP.

Weaknesses

Assuming that table 1 refers to tokens (Table 1 are tokens, not chars?), 180-200 tokens is not much really, compared to what modern models can generate. For your LLM-based evaluation metrics, have you tries using prompt that involves generating intermediate thoughts before conclusing a final answer? For reasoning models, can you please see the quality metrics as a function of the thinking budget? There have been various benchmarks for reward modeling (input Q, and two responses with prefer

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPharmacy and Medical Practices · Fuzzy Logic and Control Systems