PaTH Attention: Position Encoding via Accumulating Householder Transformations
Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim

TL;DR
PaTH introduces a data-dependent position encoding method using Householder transformations, enhancing expressivity and performance in language models compared to traditional RoPE, with efficient training algorithms and adaptability to pretrained models.
Contribution
The paper proposes PaTH, a novel position encoding scheme based on accumulated Householder transformations that is data-dependent and more expressive than existing methods like RoPE.
Findings
PaTH outperforms RoPE and recent baselines in language modeling benchmarks.
Efficient parallel algorithms enable practical training of PaTH.
Pretrained RoPE models can be converted to PaTH with continued pretraining.
Abstract
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting…
Peer Reviews
Decision·NeurIPS 2025 poster
Strengths: - I'm impressed by the overall integration of conceptual results with real-world impact, showing benefits on relevant benchmarks in language modeling and long-context tasks at scale, while also showing clock-time speed gains. - Elegant idea with the identity plus rank-1 structures - Convincing mix of evaluations Weaknesses - (minor) A bit unclear on the practical implications of NC^1 vs. TC^0 (though I do appreciate the synthetic task and proofs) - (minor) Nope is not included as an
# Strength - Their proposed PaTH attention is data-dependent, solves problems of existing softmax attention with RoPE. - They also showed a hardware-efficient kernel for inference and training the PaTH attention, which is a very impressive part. # Weakness - This method is not directly compatible with an ordinary RoPE-based attention model, so we cannot apply PaTH attention in a plug-and-play manner. I think at least we should try to transfer training (like this: https://arxiv.org/html/2310.017
Strengths: 1. The authors introduce a novel data-dependent position encoding method and provide a hardware-efficient implementation. This represents an innovation in the field. 2. The authors offer theoretical proof that the newly designed method can effectively address the state tracking issue. This has enlightening implications for enhancing the capabilities of existing models. 3. Through multiple experiments, the authors have demonstrated the effectiveness of the proposed method. Weaknesses
Strengths 1. The paper addresses the important problem of improving positional encoding in transformer models. Instead of using the fixed RoPE encoding, it introduces a novel learnable multiplicative positional encoding scheme that operates between each token pair. 2. To ensure both expressiveness and computational efficiency, the method employs a Householder-like matrix with an identity-plus-rank-one structure, along with an efficient training pipeline supported by custom Triton kernels. 3. T
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need
