Using large language models for sensitivity analysis in causal inference: case studies on Cornfield inequality and E-value
Qingyan Xiang, Jiahao Zhang, Bojian Feng

TL;DR
This study evaluates the effectiveness of large language models in conducting sensitivity analyses for observational studies, focusing on Cornfield inequalities and E-values, and finds that some models can accurately assist in these assessments.
Contribution
First investigation into using LLMs for sensitivity analysis, demonstrating their potential to accurately reproduce E-values and identify plausible unmeasured confounders.
Findings
ChatGPT, Claude, and Gemini accurately reproduce E-values.
DeepSeek shows small biases in E-value calculation.
All models identify plausible unmeasured confounders.
Abstract
Sensitivity analysis methods such as the Cornfield inequality and the E-value were developed to assess the robustness of observed associations against unmeasured confounding -- a major challenge in observational studies. However, the calculation and interpretation of these methods can be difficult for clinicians and interdisciplinary researchers. Recent advances in large language models (LLMs) offer accessible tools that could assist sensitivity analyses, but their reliability in this context has not been studied. We assess four widely used LLMs, ChatGPT, Claude, DeepSeek, and Gemini, on their ability to conduct sensitivity analyses using Cornfield inequalities and E-values. We first extract study-specific information (exposures, outcomes, measured confounders, and effect estimates) from four published observational studies in different fields. Using such information, we develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
