Don't Pay Attention
Mohammad Hammoud, Devang Acharya

TL;DR
Avey is a novel neural architecture that efficiently processes arbitrarily long sequences by selecting relevant tokens, outperforming Transformers especially on long-range dependency tasks, while maintaining competitive short-range performance.
Contribution
Avey introduces a new architecture that decouples sequence length from context, enabling efficient long-range processing without attention or recurrence.
Findings
Avey outperforms Transformers on long-range dependency tasks.
Avey matches Transformer performance on short-range NLP benchmarks.
Avey processes arbitrarily long sequences efficiently.
Abstract
The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗avey-ai/avey1-dpa-0.1B-100BTmodel
- 🤗avey-ai/avey1-dpa-0.1B-90BTmodel
- 🤗avey-ai/avey1-dpa-0.1B-95BTmodel
- 🤗avey-ai/avey1-dpa-0.5B-100BTmodel
- 🤗avey-ai/avey1-dpa-0.5B-90BTmodel· 1 dl1 dl
- 🤗avey-ai/avey1-dpa-1.5B-100BTmodel
- 🤗avey-ai/avey1-dpa-1.5B-90BTmodel· 4 dl4 dl
- 🤗avey-ai/avey1-dpa-1.5B-95BTmodel
- 🤗avey-ai/mamba-dpa-0.1B-100BTmodel· 1 dl1 dl
- 🤗avey-ai/mamba-dpa-0.1B-95BTmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Transformer
