Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling

Kaleel Mahmood; Shaoyi Huang

arXiv:2412.06106·cs.CL·February 24, 2026

Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling

Kaleel Mahmood, Shaoyi Huang

PDF

Open Access

TL;DR

This paper introduces the Efficient Context propagating Perceiver (ECP), a novel architecture that improves long-sequence language modeling by reducing attention complexity while maintaining high performance, outperforming existing models on multiple benchmarks.

Contribution

The paper develops four new Perceiver-based architectures, with ECP being the best, which overcomes key limitations of prior models by combining context and latent sequences efficiently.

Findings

01

ECP significantly outperforms state-of-the-art Transformer models on Wikitext-103.

02

ECP operates with the same attention complexity as LongLoRA, ensuring computational efficiency.

03

ECP achieves better language modeling performance through pairwise segment attention.

Abstract

One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences. Many recent research works have attempted to provide a reduction from the $O (n^{2})$ time complexity of attention to semi-linear complexity. However, it remains an unsolved problem in the sense of maintaining high performance when complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance, while reducing the computation complexity. In this paper, we use the PerceiverAR as a basis and explore the design space of different trade-offs between preserving context and reducing attention complexity. To this end, we develop four new architectural paradigms, the best performing of which we denote as the Efficient Context propagating Perceiver…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing