Evaluating Relational Reasoning in LLMs with REL
Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik

TL;DR
This paper introduces REL, a benchmark framework to evaluate relational reasoning in large language models across various domains, revealing models' performance decline as relational complexity increases.
Contribution
The paper defines relational complexity as a measure of reasoning difficulty and creates REL, a new benchmark to systematically assess LLMs' higher-arity relational reasoning capabilities.
Findings
LLMs' performance decreases monotonically with increasing relational complexity.
The decline persists even with more test-time compute and in-context learning.
Current models struggle with higher-arity reasoning, highlighting a key limitation.
Abstract
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
