Unveiling and Manipulating Prompt Influence in Large Language Models
Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Junlang Qian, Kezhi Mao

TL;DR
This paper introduces Token Distribution Dynamics (TDD), a novel method for interpreting and manipulating prompt influence in large language models, improving understanding and control over generated outputs.
Contribution
We propose TDD, a new approach leveraging the language model head to accurately interpret and manipulate prompt influence in LLMs, surpassing existing saliency methods.
Findings
TDD outperforms state-of-the-art baselines in elucidating prompt-output relationships.
TDD effectively identifies toxic and sentimental cues for controlled text generation.
Empirical results demonstrate TDD's success in reducing toxicity and steering sentiment in outputs.
Abstract
Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in shaping the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), a \textcolor{black}{simple yet effective} approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each…
Peer Reviews
Decision·ICLR 2024 poster
- The idea of analyzing the token distributions throughout the progression of prediction is quite interesting. The idea is simple but seems to be quite useful in unveiling the importance of input tokens when providing contrastive explanations. - The authors did experiments over a fairly comprehensive set of language models including GPT-2/J, BLOOM, and LLaMA. - The applications on toxic language suppression and sentiment steering further demonstrate the usefulness of the proposed TDD method.
- One simple way to control LLMs' generations is via prompting for style transfer, e.g., ask the model to transform the outputs into "less toxic" content, or "positive/negative" sentiment. How would this baseline compare to the proposed TDD? - The models used in experiments are relatively smaller models (maximum size 7B), how would the proposed approach work on larger models (e.g., llama-2 13B, 70B)? What is the computation cost (efficiency, memory) for running TDD over larger models? - The pr
- The proposed method is a simple and efficient method - In general, the paper is well-written, with a clear introduction method and description of experiments - Evaluation of multiple autoregressive models - The authors seem to provide all the necessary details for reproducible experiments - Additional showcasing on interesting and important use cases
- The difference between the TDD variants is not well discussed. While the applications are very interesting, the authors could have used the space to elaborate on the differences between the introduced variants. - Captions Table 1 and Table 3 are too sparse. - While not being a contrastive XAI method, important related work on explainability for autoregressive LMs missing: ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. Björn Deiseroth, Mayukh Deb
- The method is simple and efficient. - The analysis towards understanding causal mechanisms of token probability distributions is a timely and important research topic.
- The TDD methods heavily depend on the alternative word $w_a$. The datasets chosen in this paper (those BLiMP datasets) provide sentence pairs with exactly one-word differences. Other datasets may not have such well-defined alternative words. This greatly limits the potential applicability of the proposed TDD methods. - A related point: it is unclear to me how the w_a in the Section 5 experiments are identified. - The analyses presented in this paper are not really causal analyses, despite
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
