Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification
Paul He, Yinya Huang, Mrinmaya Sachan, Zhijing Jin

TL;DR
This paper introduces DoVerifier, a symbolic verification method that assesses the semantic correctness of LLM-generated causal reasoning expressions, improving evaluation accuracy over surface-level metrics.
Contribution
We propose DoVerifier, a symbolic verifier that checks the formal validity of causal reasoning in LLM outputs, addressing limitations of existing benchmarks.
Findings
DoVerifier recovers correct causal answers missed by traditional metrics.
It more accurately evaluates the semantic correctness of causal reasoning.
Demonstrates improved evaluation on synthetic and benchmark datasets.
Abstract
Large language models (LLMs) are increasingly being applied to tasks that involve causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that do not capture whether the output of a model is formally valid under the semantics of causal reasoning. To address this, we propose DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers to causal queries that would otherwise be marked incorrect due to superficial differences in their causal semantics. Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
