Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang; Junyi Tao; Thomas Icard; Diyi Yang; Christopher Potts

arXiv:2505.11770·cs.LG·November 12, 2025

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that internal causal mechanisms in language models can be used to predict their out-of-distribution behaviors accurately, leveraging causal features for improved interpretability and robustness.

Contribution

It introduces two causal-based methods, counterfactual simulation and value probing, to predict model correctness on out-of-distribution data, advancing interpretability techniques.

Findings

01

Causal features are highly predictive of model correctness.

02

Proposed methods outperform causal-agnostic baselines in OOD settings.

03

Internal causal mechanisms can reliably forecast model behavior beyond training distribution.

Abstract

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

explanare/ood-prediction
noneOfficial

Videos

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsSparse Evolutionary Training