ACCORD: Closing the Commonsense Measurability Gap

Fran\c{c}ois Roewer-Despr\'es; Jinyue Feng; Zining Zhu; Frank; Rudzicz

arXiv:2406.02804·cs.AI·February 10, 2025

ACCORD: Closing the Commonsense Measurability Gap

Fran\c{c}ois Roewer-Despr\'es, Jinyue Feng, Zining Zhu, Frank, Rudzicz

PDF

Open Access 1 Repo 1 Video

TL;DR

ACCORD introduces a scalable benchmark suite for evaluating and disentangling the reasoning and grounding abilities of large language models using controlled, multi-hop counterfactuals, revealing significant performance gaps.

Contribution

It provides a novel framework and benchmark suite that explicitly control reasoning complexity and automatically generate tests, enabling scalable evaluation of LLM reasoning capabilities.

Findings

01

State-of-the-art LLMs' performance drops to chance with increased reasoning complexity.

02

ACCORD's benchmarks can be scaled to arbitrary reasoning levels.

03

Substantial room for improvement remains in LLM reasoning abilities.

Abstract

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

francois-rd/accord
noneOfficial

Videos

ACCORD: Closing the Commonsense Measurability Gap· underline

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies