InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Shaojie Shi; Zhengyu Shi; Lingran Zheng; Xinyu Su; Anna Xie; Bohao Lv; Rui Xu; Zijian Chen; Zhichao Chen; Guolei Liu; Naifu Zhang; Mingjian Dong; Zhuo Quan; Bohao Chen; Teqi Hao; Yuan Qi; Yinghui Xu; and Libo Wu

arXiv:2603.15542·cs.CY·March 17, 2026

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, and Libo Wu

PDF

Open Access

TL;DR

InterveneBench is a new benchmark for evaluating large language models' ability to reason about intervention and causal inference in realistic social science scenarios, revealing current limitations and proposing a multi-agent solution.

Contribution

The paper introduces InterveneBench, a benchmark based on real social science studies, and proposes STRIDES, a multi-agent framework that improves LLM reasoning in causal intervention tasks.

Findings

01

State-of-the-art LLMs perform poorly on InterveneBench.

02

STRIDES significantly outperforms existing reasoning models.

03

InterveneBench covers 744 diverse social science studies.

Abstract

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Advanced Causal Inference Techniques · Explainable Artificial Intelligence (XAI)