Unveiling and Manipulating Prompt Influence in Large Language Models

Zijian Feng; Hanzhang Zhou; Zixiao Zhu; Junlang Qian; Kezhi Mao

arXiv:2405.11891·cs.CL·May 21, 2024·2 cites

Unveiling and Manipulating Prompt Influence in Large Language Models

Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Junlang Qian, Kezhi Mao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Token Distribution Dynamics (TDD), a novel method for interpreting and manipulating prompt influence in large language models, improving understanding and control over generated outputs.

Contribution

We propose TDD, a new approach leveraging the language model head to accurately interpret and manipulate prompt influence in LLMs, surpassing existing saliency methods.

Findings

01

TDD outperforms state-of-the-art baselines in elucidating prompt-output relationships.

02

TDD effectively identifies toxic and sentimental cues for controlled text generation.

03

Empirical results demonstrate TDD's success in reducing toxicity and steering sentiment in outputs.

Abstract

Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in shaping the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), a \textcolor{black}{simple yet effective} approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The idea of analyzing the token distributions throughout the progression of prediction is quite interesting. The idea is simple but seems to be quite useful in unveiling the importance of input tokens when providing contrastive explanations. - The authors did experiments over a fairly comprehensive set of language models including GPT-2/J, BLOOM, and LLaMA. - The applications on toxic language suppression and sentiment steering further demonstrate the usefulness of the proposed TDD method.

Weaknesses

- One simple way to control LLMs' generations is via prompting for style transfer, e.g., ask the model to transform the outputs into "less toxic" content, or "positive/negative" sentiment. How would this baseline compare to the proposed TDD? - The models used in experiments are relatively smaller models (maximum size 7B), how would the proposed approach work on larger models (e.g., llama-2 13B, 70B)? What is the computation cost (efficiency, memory) for running TDD over larger models? - The pr

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

- The proposed method is a simple and efficient method - In general, the paper is well-written, with a clear introduction method and description of experiments - Evaluation of multiple autoregressive models - The authors seem to provide all the necessary details for reproducible experiments - Additional showcasing on interesting and important use cases

Weaknesses

- The difference between the TDD variants is not well discussed. While the applications are very interesting, the authors could have used the space to elaborate on the differences between the introduced variants. - Captions Table 1 and Table 3 are too sparse. - While not being a contrastive XAI method, important related work on explainability for autoregressive LMs missing: ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. Björn Deiseroth, Mayukh Deb

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The method is simple and efficient. - The analysis towards understanding causal mechanisms of token probability distributions is a timely and important research topic.

Weaknesses

- The TDD methods heavily depend on the alternative word $w_a$. The datasets chosen in this paper (those BLiMP datasets) provide sentence pairs with exactly one-word differences. Other datasets may not have such well-defined alternative words. This greatly limits the potential applicability of the proposed TDD methods. - A related point: it is unclear to me how the w_a in the Section 5 experiments are identified. - The analyses presented in this paper are not really causal analyses, despite

Code & Models

Repositories

zijian678/tdd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems