Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Jena D. Hwang; Varsha Kishore; Amanpreet Singh; Dany Haddad; Aakanksha Naik; Malachi Hamada; Jonathan Bragg; Mike D'Arcy; Daniel S. Weld; Lucy Lu Wang; Doug Downey; Sergey Feldman

arXiv:2603.06942·cs.CL·March 10, 2026

Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman

PDF

Open Access

TL;DR

This paper critically examines the effectiveness of human pairwise preferences in meta-evaluating long-form QA benchmarks, highlighting limitations and proposing guidelines for improved evaluation practices in scientific report generation systems.

Contribution

It provides a comprehensive case study on meta-evaluation methods for scientific QA benchmarks, revealing the strengths and weaknesses of pairwise preferences and offering practical guidelines for future evaluations.

Findings

01

Pairwise preferences are effective at system-level evaluation.

02

Explicit metric annotations are crucial for metric-level assessment.

03

Subjectivity remains a key challenge in human evaluations.

Abstract

Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling · Mobile Crowdsensing and Crowdsourcing