Counterfactual Token Generation in Large Language Models
Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis,, Manuel Gomez-Rodriguez

TL;DR
This paper introduces a simple, efficient method for enabling large language models to generate counterfactual tokens, allowing for reasoning about alternative scenarios without additional training or fine-tuning.
Contribution
We propose a causal model based on the Gumbel-Max structural causal model that enables counterfactual token generation in large language models without fine-tuning.
Findings
Counterfactual token generation is feasible with minimal computational overhead.
The method works effectively on Llama 3 8B-Instruct and Ministral-8B-Instruct models.
Counterfactual analysis reveals biases and world models in language models.
Abstract
"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Topic Modeling
MethodsAttention Model · LLaMA
