Counterfactual Token Generation in Large Language Models

Ivi Chatzi; Nina Corvelo Benz; Eleni Straitouri; Stratis Tsirtsis,; Manuel Gomez-Rodriguez

arXiv:2409.17027·cs.LG·March 26, 2025

Counterfactual Token Generation in Large Language Models

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis,, Manuel Gomez-Rodriguez

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple, efficient method for enabling large language models to generate counterfactual tokens, allowing for reasoning about alternative scenarios without additional training or fine-tuning.

Contribution

We propose a causal model based on the Gumbel-Max structural causal model that enables counterfactual token generation in large language models without fine-tuning.

Findings

01

Counterfactual token generation is feasible with minimal computational overhead.

02

The method works effectively on Llama 3 8B-Instruct and Ministral-8B-Instruct models.

03

Counterfactual analysis reveals biases and world models in language models.

Abstract

"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

networks-learning/counterfactual-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Topic Modeling

MethodsAttention Model · LLaMA