METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li; Chen Huang; Chaoqun Hao; Hongyao Chen; Xiao-Yong Wei; Wenqiang Lei; See-Kiong Ng

arXiv:2604.11502·cs.CL·April 17, 2026

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng

PDF

1 Repo

TL;DR

This paper introduces METER, a comprehensive benchmark for evaluating large language models' ability to perform multi-level contextual causal reasoning across the entire causal hierarchy.

Contribution

It systematically assesses LLMs on all causal hierarchy levels within a unified context, revealing performance degradation and diagnosing underlying failure modes.

Findings

01

LLMs' performance declines as causal tasks become more complex.

02

Two main failure modes are identified: distraction by irrelevant facts and reduced context faithfulness.

03

Code and dataset are publicly available at the provided GitHub URL.

Abstract

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SCUNLP/METER
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.