TL;DR
LongSumEval introduces a question-answering based framework for evaluating and improving long document summaries, providing interpretable scores and actionable feedback aligned with human judgments.
Contribution
It presents a novel QA-based evaluation method that correlates better with human judgments and enables self-refinement without retraining.
Findings
QA-based evaluation outperforms existing metrics in agreement with human judgments.
Structured feedback facilitates significant quality improvements through self-refinement.
Evaluation feedback can be used as executable instructions to guide generation.
Abstract
Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
