Abstract Counterfactuals for Language Model Agents
Edoardo Pona, Milad Kazemi, Yali Du, David Watson, Nicola Paoletti

TL;DR
This paper introduces Abstract Counterfactuals, a framework for high-level counterfactual reasoning in language model agents, addressing limitations of token-level methods by focusing on meaningful, environment-relevant features.
Contribution
The paper presents a novel framework for counterfactual inference in LM agents that emphasizes high-level features, improving interpretability and relevance over token-level approaches.
Findings
Produces consistent and meaningful counterfactuals
Reduces side effects compared to token-level methods
Effective in text-based games and counterfactual text generation
Abstract
Counterfactual inference is a powerful tool for analysing and evaluating autonomous agents, but its application to language model (LM) agents remains challenging. Existing work on counterfactuals in LMs has primarily focused on token-level counterfactuals, which are often inadequate for LM agents due to their open-ended action spaces. Unlike traditional agents with fixed, clearly defined action spaces, the actions of LM agents are often implicit in the strings they output, making their action spaces difficult to define and interpret. Furthermore, the meanings of individual tokens can shift depending on the context, adding complexity to token-level reasoning and sometimes leading to biased or meaningless counterfactuals. We introduce \emph{Abstract Counterfactuals}, a framework that emphasises high-level characteristics of actions and interactions within an environment, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
