Mitigate Position Bias in Large Language Models via Scaling a Single Dimension
Yijiong Yu, Huiqiang Jiang, Xufang Luo, Qianhui Wu, Chin-Yew Lin, Dongsheng Li, Yuqing Yang, Yongfeng Huang, Lili Qiu

TL;DR
This paper identifies position bias in large language models, especially in long-context scenarios, and proposes a simple scaling method of positional hidden states to mitigate this bias, improving performance across multiple tasks.
Contribution
The paper introduces a novel approach to reduce position bias in LLMs by scaling a single dimension of positional hidden states, demonstrating significant performance gains.
Findings
Improves model performance by up to 15.2%
Attention weights reflect position bias at the micro-level
Method is effective across various models and tasks
Abstract
Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper studies position bias from the angle of hidden states, which I find to be interesting. 2. The experiment setup seems comprehensive, covering a wide range of models and tasks. 3. The analysis and insights could contribute to the understanding of position bias in LLMs.
1. The finding in Section 2.2 (Causal mask also contributes to position bias) seems trivial to me. The attention mechanism is a core component of Transformer models and naturally plays a significant role in model behaviors. By modifying the attention mask to let the target token only sees itself naturally leads to drastically increased attention weight and KV retrieval performance. The link between Section 2.2 and the main hypothesis on position bias in hidden states seems very weak. 2. The auth
1. The method shows improvements across various models and tasks, indicating broad applicability. 2. By leveraging FlashAttention and modifying only one dimension, the method remains efficient, with minimal latency impact. 3. The method shows up to 15.2% improvement on position-sensitive benchmarks, suggesting it addresses bias effectively.
1. Only one hidden state dimension is scaled, which may not capture more nuanced, layer-specific positional dependencies. 2. The impact of position scaling seems to vary by task, suggesting that tuning may be required for optimal results in different contexts. 3. By focusing on single-dimension scaling, the model may become overly specialized to certain bias patterns rather than general long-context processing needs.
- The work is well-motivated and important. It is important for end-users that LLMs have consistent performance regardless of in what order information is presented to an LLM. It is therefore very important to address the lost in the middle problem. In addition, the authors provide a thorough exploration of how attention and causal masks are implicated in creating this order biasing effect, which they then attempt use to motivate their methodology. - The experimental setting is quite thorough,
- This paper suffers from lack of clarity. Section 1 and 2 are quite challenging to read, with too many references to the Appendices. Although I believe I now understand their method, it was a challenging process. I highly recommend editing the Figure 4 caption and the figure itself to make the process clearer, i.e. make it explicit that only the last token is scaled and this is why there are multiple colors in the figure. - The experiment shown in Section 2.2 are a bit problematic, changing
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsRotary Position Embedding · Attention with Linear Biases
