Modality Agnostic Efficient Long Range Encoder
Toufiq Parag, Ahmed Elgammal

TL;DR
This paper introduces MAELRE, a modality-agnostic transformer architecture that efficiently processes long-range sequences on a single device by reducing quadratic complexity through token merging and attention approximation, achieving better accuracy and efficiency.
Contribution
MAELRE is a novel unified transformer architecture that combines token merging with attention approximation for efficient long-range encoding across multiple modalities.
Findings
MAELRE outperforms existing models in accuracy on diverse modality tasks.
It significantly reduces computational cost compared to traditional long-context models.
MAELRE maintains high accuracy while handling longer sequences efficiently.
Abstract
The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations -- such as token merging and modified attentions -- are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
