Parameterized Argumentation-based Reasoning Tasks for Benchmarking   Generative Language Models

Cor Steging; Silja Renooij; Bart Verheij

arXiv:2505.01539·cs.AI·May 6, 2025

Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models

Cor Steging, Silja Renooij, Bart Verheij

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable, unambiguous benchmark for evaluating the reasoning capabilities of large language models in legal contexts, revealing their brittleness and limitations even at low complexity levels.

Contribution

It presents a novel approach to create dynamic, scalable reasoning benchmarks based on argument attack graphs, specifically applied to witness testimony in legal scenarios.

Findings

01

State-of-the-art models often fail at low complexity reasoning puzzles.

02

Models exhibit inconsistent performance and make obvious mistakes.

03

Higher complexity puzzles further expose the brittleness of current models.

Abstract

Generative large language models as tools in the legal domain have the potential to improve the justice system. However, the reasoning behavior of current generative models is brittle and poorly understood, hence cannot be responsibly applied in the domains of law and evidence. In this paper, we introduce an approach for creating benchmarks that can be used to evaluate the reasoning capabilities of generative language models. These benchmarks are dynamically varied, scalable in their complexity, and have formally unambiguous interpretations. In this study, we illustrate the approach on the basis of witness testimony, focusing on the underlying argument attack structure. We dynamically generate both linear and non-linear argument attack graphs of varying complexity and translate these into reasoning puzzles about witness testimony expressed in natural language. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

corsteging/parameterizedargumentationbasedreasoningtasks
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Model-Driven Software Engineering Techniques