Inference-Time Intervention: Eliciting Truthful Answers from a Language   Model

Kenneth Li; Oam Patel; Fernanda Vi\'egas; Hanspeter Pfister; Martin; Wattenberg

arXiv:2306.03341·cs.LG·June 27, 2024·39 cites

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Vi\'egas, Hanspeter Pfister, Martin, Wattenberg

PDF

Open Access 1 Repo 8 Models 1 Video

TL;DR

This paper presents Inference-Time Intervention (ITI), a minimally invasive method that shifts model activations during inference to significantly improve the truthfulness of large language models, with minimal data and computational costs.

Contribution

The paper introduces ITI, a novel technique for enhancing LLM truthfulness during inference by adjusting activations, requiring minimal data and computational resources.

Findings

01

ITI improves LLaMA's truthfulness on TruthfulQA from 32.5% to 65.1%.

02

ITI is minimally invasive and computationally inexpensive.

03

A tradeoff exists between truthfulness and helpfulness, tunable via intervention strength.

Abstract

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

likenneth/honest_llama
pytorchOfficial

Models

Videos

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques