CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending
Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang, Yifan Wu, Jianguo, Li

TL;DR
This paper introduces CoCA, a novel attention mechanism that enhances long context window extension in transformers by integrating position embedding with self-attention through a collinear constraint, achieving significant extrapolation improvements.
Contribution
The paper proposes CoCA, a new attention method that seamlessly combines RoPE and self-attention with minimal complexity, enabling transformers to extend context windows up to 32K without fine-tuning.
Findings
CoCA enables GPT models to extend context from 512 to 32K without fine-tuning.
Dropping CoCA in LLaMA-7B achieves 32K context extension within 2K training length.
CoCA performs well as a drop-in replacement for existing transformer models.
Abstract
Self-attention and position embedding are two key modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors harming long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention unveiled by our work. To address this issue, we propose a novel attention mechanism, CoCA (Collinear Constrained Attention). Specifically, we enforce a collinear constraint between and to seamlessly integrate RoPE and self-attention. While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models. Extensive experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Weight Decay · Attention Dropout · GPT · Softmax · Dense Connections
