Can Post-Training Transform LLMs into Causal Reasoners?
Junqi Chen, Sirui Chen, Chaochao Lu

TL;DR
This paper demonstrates that targeted post-training significantly enhances large language models' ability to perform causal inference, achieving high accuracy and robustness across various benchmarks and real-world scenarios.
Contribution
It introduces CauGym, a new dataset for causal tasks, and systematically evaluates post-training methods, showing their effectiveness in improving LLM causal reasoning.
Findings
Smaller LLMs can outperform larger models with proper post-training.
Achieved 93.5% accuracy on CaLM benchmark with a 14B parameter model.
Post-trained LLMs show strong generalization and robustness.
Abstract
Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs' capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI) · Topic Modeling
