On the Robustness of Transformers against Context Hijacking for Linear   Classification

Tianle Li; Chenyang Zhang; Xingwu Chen; Yuan Cao; Difan Zou

arXiv:2502.15609·cs.CL·February 24, 2025

On the Robustness of Transformers against Context Hijacking for Linear Classification

Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou

PDF

TL;DR

This paper investigates the robustness of transformer models against context hijacking in linear classification, providing theoretical analysis and empirical evidence that deeper transformers are more resistant to interference from factually correct but misleading context tokens.

Contribution

It offers a theoretical framework explaining how transformer depth influences robustness to context hijacking, supported by numerical experiments and analysis of linear transformers.

Findings

01

Deeper transformers exhibit higher robustness to context hijacking.

02

Increased depth enables more fine-grained optimization, reducing interference.

03

Theoretical analysis aligns with empirical results on robustness improvements.

Abstract

Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.