Analysing the Residual Stream of Language Models Under Knowledge   Conflicts

Yu Zhao; Xiaotang Du; Giwon Hong; Aryo Pradipta Gema; Alessio Devoto,; Hongru Wang; Xuanli He; Kam-Fai Wong; Pasquale Minervini

arXiv:2410.16090·cs.CL·February 11, 2025

Analysing the Residual Stream of Language Models Under Knowledge Conflicts

Yu Zhao, Xiaotang Du, Giwon Hong, Aryo Pradipta Gema, Alessio Devoto,, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models internally detect and manage conflicts between stored knowledge and contextual information, enabling better understanding and control of their decision-making process.

Contribution

It demonstrates that LLMs can internally register knowledge conflicts in the residual stream, allowing detection without model modification, and distinguishes patterns based on knowledge reliance.

Findings

01

LLMs can detect knowledge conflicts in the residual stream.

02

Patterns differ when relying on contextual versus parametric knowledge.

03

Conflict detection can be achieved without modifying the model or input.

Abstract

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuzhaouoe/sae-based-representation-engineering
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling