CoCA: Fusing Position Embedding with Collinear Constrained Attention in   Transformers for Long Context Window Extending

Shiyi Zhu; Jing Ye; Wei Jiang; Siqiao Xue; Qi Zhang; Yifan Wu; Jianguo; Li

arXiv:2309.08646·cs.LG·February 29, 2024

CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending

Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang, Yifan Wu, Jianguo, Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CoCA, a novel attention mechanism that enhances long context window extension in transformers by integrating position embedding with self-attention through a collinear constraint, achieving significant extrapolation improvements.

Contribution

The paper proposes CoCA, a new attention method that seamlessly combines RoPE and self-attention with minimal complexity, enabling transformers to extend context windows up to 32K without fine-tuning.

Findings

01

CoCA enables GPT models to extend context from 512 to 32K without fine-tuning.

02

Dropping CoCA in LLaMA-7B achieves 32K context extension within 2K training length.

03

CoCA performs well as a drop-in replacement for existing transformer models.

Abstract

Self-attention and position embedding are two key modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors harming long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention unveiled by our work. To address this issue, we propose a novel attention mechanism, CoCA (Collinear Constrained Attention). Specifically, we enforce a collinear constraint between $Q$ and $K$ to seamlessly integrate RoPE and self-attention. While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models. Extensive experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codefuse-ai/Collinear-Constrained-Attention
pytorchOfficial

Videos

CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Weight Decay · Attention Dropout · GPT · Softmax · Dense Connections