ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented   Argumentation with LLM Judges

Kaustubh D. Dhole; Kai Shu; Eugene Agichtein

arXiv:2412.05206·cs.CL·December 9, 2024

ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

Kaustubh D. Dhole, Kai Shu, Eugene Agichtein

PDF

Open Access 2 Repos

TL;DR

This paper introduces ConQRet, a benchmark for evaluating retrieval-augmented argumentation using multiple fine-grained LLM judges, addressing the limitations of existing datasets and evaluation methods for complex, evidence-based arguments.

Contribution

It proposes a new benchmark and automated evaluation framework with multiple LLM judges for nuanced assessment of retrieval effectiveness and argument quality in complex, real-world scenarios.

Findings

01

LLM judges outperform traditional metrics and crowdsourcing in evaluation accuracy.

02

ConQRet provides a comprehensive dataset of long, complex arguments grounded in real-world evidence.

03

Automated evaluation methods show promise for advancing computational argumentation research.

Abstract

Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations · Law, Economics, and Judicial Systems