Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Dingzirui Wang; Xuangliang Zhang; Keyan Xu; Qingfu Zhu; Wanxiang Che; Yang Deng

arXiv:2508.00385·cs.CL·August 4, 2025

Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Dingzirui Wang, Xuangliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

PDF

Open Access

TL;DR

This paper investigates why some demonstrations in in-context learning are ineffective, reveals that multi-layer models amplify differences in demonstration effectiveness, and introduces GradS, a gradient-based demonstration selection method that improves performance.

Contribution

The paper provides a theoretical analysis of demonstration effectiveness, shows how multi-layer models amplify effectiveness disparities, and proposes GradS for better demonstration selection based on gradient flow.

Findings

01

Effectiveness disparity among demonstrations increases with model layers.

02

GradS improves demonstration selection by leveraging gradient flow.

03

Experimental validation shows GradS outperforms baselines by 6.8% on average.

Abstract

Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Information Retrieval and Search Behavior