Long-Span Question-Answering: Automatic Question Generation and   QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet; Kevin Swersky; Rosanne Liu; Pranjal Awasthi; Azade Nova,; Javier Snaider; Hanie Sedghi; Aaron T Parisi; Michael Collins; Angeliki; Lazaridou; Orhan Firat; Noah Fiedel

arXiv:2406.00179·cs.CL·June 4, 2024

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova,, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki, Lazaridou, Orhan Firat, Noah Fiedel

PDF

Open Access

TL;DR

This paper leverages large language models' long-context capabilities to automatically generate and evaluate reading comprehension questions from entire books, demonstrating improved performance over baseline methods.

Contribution

It introduces a holistic pipeline for automatic question generation, answering, and model evaluation using pairwise comparison and a Bradley-Terry ranking model, advancing long-span comprehension assessment.

Findings

01

Pairwise model comparison yields more consistent scores.

02

LLMs show moderate agreement in answer ratings.

03

Using entire books as context improves comprehension performance.

Abstract

We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques