Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models
Cor Steging, Silja Renooij, Bart Verheij

TL;DR
This paper introduces a scalable, unambiguous benchmark for evaluating the reasoning capabilities of large language models in legal contexts, revealing their brittleness and limitations even at low complexity levels.
Contribution
It presents a novel approach to create dynamic, scalable reasoning benchmarks based on argument attack graphs, specifically applied to witness testimony in legal scenarios.
Findings
State-of-the-art models often fail at low complexity reasoning puzzles.
Models exhibit inconsistent performance and make obvious mistakes.
Higher complexity puzzles further expose the brittleness of current models.
Abstract
Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Model-Driven Software Engineering Techniques
