On the Evaluation of Machine-Generated Reports
James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney and, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion, Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

TL;DR
This paper discusses the challenges of automatically generating long, accurate, and verifiable reports using LLMs, proposing a flexible evaluation framework based on information nuggets and citation verification.
Contribution
It introduces a novel evaluation framework for report generation that emphasizes completeness, accuracy, and verifiability, addressing gaps in current LLM capabilities.
Findings
Framework uses question-answer nuggets to assess completeness and accuracy.
Citation evaluation ensures claims are verifiable against sources.
Highlights need for new evaluation methods for high-quality report generation.
Abstract
Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and -- critically -- a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
