Word Importance Explains How Prompts Affect Language Model Outputs

Stefan Hackmann; Haniyeh Mahmoudian; Mark Steadman; Michael Schmidt

arXiv:2403.03028·cs.AI·March 6, 2024·1 cites

Word Importance Explains How Prompts Affect Language Model Outputs

Stefan Hackmann, Haniyeh Mahmoudian, Mark Steadman, Michael Schmidt

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method to explain how individual words in prompts influence large language model outputs by measuring their statistical impact, enhancing transparency and interpretability of LLMs.

Contribution

The study proposes a novel word importance measure based on permutation importance, applicable even without attention weights, to analyze prompt effects on LLM outputs.

Findings

01

Word importance scores correlate with suffix importance across models.

02

The method works with various scoring functions.

03

It improves understanding of prompt influence on LLM behavior.

Abstract

The emergence of large language models (LLMs) has revolutionized numerous applications across industries. However, their "black box" nature often hinders the understanding of how they make specific decisions, raising concerns about their transparency, reliability, and ethical use. This study presents a method to improve the explainability of LLMs by varying individual words in prompts to uncover their statistical impact on the model outputs. This approach, inspired by permutation importance for tabular data, masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores, which enables decomposing the importance of words into the specific measures of interest--including bias, reading…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 1· strong rejectConfidence 3

Strengths

1. This paper presents a method to masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs.

Weaknesses

1. The contribution of the paper is limited, similar topics have been investigated before while this paper didn’t pose any more valuable conclusions. 2. The experiment section is terribly organized. No quantitative results are provided. The experiment design is very confusing and too specific. 3. The presentation is really bad a. All the figures are poorly illustrated. There is even an untitled algorithm diagram before Section 4. b. All the tables are also hasty and careless.

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

* The paper utilizes a common technique in NLP (word saliencies) and applies the concept of word importances to a recent LLM. Doing so can lead to informative insights into model interpretability as pointed out in the paper.

Weaknesses

* The dataset used for the experiment has been generated with an LLM. This is problematic since the dataset is biased towards generations from another LLM and does not necessarily reflect a distribution of human inputs. As such, the reported results do not necessarily hold true for human inputs. It would therefore be important to conduct experiments on a human-written dataset as well. * The paper focuses substantially on an importance comparison between individual words and an instruction suffix

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

The paper explores and interesting concept of word importance which I do find essential in further understanding how large language models like ChatGPT works. The proposed method has some potential provided that it carefully addresses some of the very obvious limitations discussed below and further improve its algorithmic features to consider scale, flexibility, and efficiency.

Weaknesses

The depth of the experiments conducted in the study is extremely limited as only three metrics which cover Flesch Ease, word count, and topic similarity (cosine embedding) have been explored. The model variation is also very limited, with only one model used for experimentation, GPT-3.5-Turbo (ChatGPT), despite the diverse publicly available models in Hugginface such as Llama, FlanT5, BLOOMZ. This implies that the study essentially optimizes for OpenAI products instead of prioritizing diverse re

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques