Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Yijiong Yu; Huiqiang Jiang; Xufang Luo; Qianhui Wu; Chin-Yew Lin; Dongsheng Li; Yuqing Yang; Yongfeng Huang; Lili Qiu

arXiv:2406.02536·cs.CL·May 26, 2025·2 cites

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Yijiong Yu, Huiqiang Jiang, Xufang Luo, Qianhui Wu, Chin-Yew Lin, Dongsheng Li, Yuqing Yang, Yongfeng Huang, Lili Qiu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper identifies position bias in large language models, especially in long-context scenarios, and proposes a simple scaling method of positional hidden states to mitigate this bias, improving performance across multiple tasks.

Contribution

The paper introduces a novel approach to reduce position bias in LLMs by scaling a single dimension of positional hidden states, demonstrating significant performance gains.

Findings

01

Improves model performance by up to 15.2%

02

Attention weights reflect position bias at the micro-level

03

Method is effective across various models and tasks

Abstract

Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

1. This paper studies position bias from the angle of hidden states, which I find to be interesting. 2. The experiment setup seems comprehensive, covering a wide range of models and tasks. 3. The analysis and insights could contribute to the understanding of position bias in LLMs.

Weaknesses

1. The finding in Section 2.2 (Causal mask also contributes to position bias) seems trivial to me. The attention mechanism is a core component of Transformer models and naturally plays a significant role in model behaviors. By modifying the attention mask to let the target token only sees itself naturally leads to drastically increased attention weight and KV retrieval performance. The link between Section 2.2 and the main hypothesis on position bias in hidden states seems very weak. 2. The auth

Reviewer 02Rating 5Confidence 4

Strengths

1. The method shows improvements across various models and tasks, indicating broad applicability. 2. By leveraging FlashAttention and modifying only one dimension, the method remains efficient, with minimal latency impact. 3. The method shows up to 15.2% improvement on position-sensitive benchmarks, suggesting it addresses bias effectively.

Weaknesses

1. Only one hidden state dimension is scaled, which may not capture more nuanced, layer-specific positional dependencies. 2. The impact of position scaling seems to vary by task, suggesting that tuning may be required for optimal results in different contexts. 3. By focusing on single-dimension scaling, the model may become overly specialized to certain bias patterns rather than general long-context processing needs.

Reviewer 03Rating 5Confidence 3

Strengths

- The work is well-motivated and important. It is important for end-users that LLMs have consistent performance regardless of in what order information is presented to an LLM. It is therefore very important to address the lost in the middle problem. In addition, the authors provide a thorough exploration of how attention and causal masks are implicated in creating this order biasing effect, which they then attempt use to motivate their methodology. - The experimental setting is quite thorough,

Weaknesses

- This paper suffers from lack of clarity. Section 1 and 2 are quite challenging to read, with too many references to the Appendices. Although I believe I now understand their method, it was a challenging process. I highly recommend editing the Figure 4 caption and the figure itself to make the process clearer, i.e. make it explicit that only the last token is scaled and this is why there are multiple colors in the figure. - The experiment shown in Section 2.2 are a bit problematic, changing

Code & Models

Repositories

PositionalHidden/PositionalHidden
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsRotary Position Embedding · Attention with Linear Biases