ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions
Serafim Batzoglou

TL;DR
ReplaySCM is a comprehensive benchmark for evaluating models' ability to induce executable causal mechanisms from intervention data, emphasizing replay behavior over syntactic similarity.
Contribution
It introduces a new benchmark with diverse tasks and settings to assess causal mechanism induction and evaluates the robustness of large language models in this domain.
Findings
Held-out replay performance drops sharply when order or root structure is hidden.
Support-audit methods significantly improve local predecessor-pattern coverage.
No semantic alternative remains consistent with training worlds under stronger evidence.
Abstract
Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
