ACCORD: Closing the Commonsense Measurability Gap
Fran\c{c}ois Roewer-Despr\'es, Jinyue Feng, Zining Zhu, Frank, Rudzicz

TL;DR
ACCORD introduces a scalable benchmark suite for evaluating and disentangling the reasoning and grounding abilities of large language models using controlled, multi-hop counterfactuals, revealing significant performance gaps.
Contribution
It provides a novel framework and benchmark suite that explicitly control reasoning complexity and automatically generate tests, enabling scalable evaluation of LLM reasoning capabilities.
Findings
State-of-the-art LLMs' performance drops to chance with increased reasoning complexity.
ACCORD's benchmarks can be scaled to arbitrary reasoning levels.
Substantial room for improvement remains in LLM reasoning abilities.
Abstract
We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies
