MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Weiqi Wang; Yangqiu Song

arXiv:2406.02106·cs.CL·May 22, 2025

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Weiqi Wang, Yangqiu Song

PDF

Open Access 2 Repos 8 Models

TL;DR

This paper introduces MARS, a comprehensive benchmark dataset designed to evaluate large language models' abilities to reason about situational changes and transitions, addressing a critical gap in metaphysical reasoning evaluation.

Contribution

It proposes a novel three-step reasoning framework for distributional changes and provides the first benchmark dataset, MARS, to systematically assess LLMs' metaphysical reasoning abilities.

Findings

01

All tested LLMs struggle with the tasks, even after fine-tuning.

02

Pre-training on conceptual taxonomies can improve reasoning performance.

03

The benchmark reveals significant challenges in metaphysical reasoning for current models.

Abstract

To enable Large Language Models (LLMs) to function as conscious agents with generalizable reasoning capabilities, it is crucial that they possess the reasoning ability to comprehend situational changes (transitions) in distribution triggered by environmental factors or actions from other agents. Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of reasoning with distributional changes as a three-step discriminative process, termed as MetAphysical ReaSoning. We then introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step. These tasks systematically assess LLMs' capabilities in reasoning the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)