Attention and Compression is all you need for Controllably Efficient Language Models
Jatin Prakash, Aahlad Puli, Rajesh Ranganath

TL;DR
The paper introduces the Compress & Attend Transformer (CAT), a simple yet flexible architecture that employs dense attention and compression to improve efficiency and controllability in language models without sacrificing quality.
Contribution
It proposes a novel adaptive transformer architecture that allows dynamic quality-compute trade-offs at test time using only dense attention and compression, outperforming existing methods.
Findings
Outperforms existing efficient baselines across tasks.
Matches dense transformer performance while being faster and more memory-efficient.
Enables control of quality and efficiency trade-offs without retraining.
Abstract
The quadratic cost of attention in transformers motivated the development of efficient approaches: namely sparse and sliding window attention, convolutions and linear attention. Although these approaches result in impressive reductions in compute and memory, they often trade-off with quality, specifically in-context recall performance. Moreover, apriori fixing this quality-compute tradeoff means being suboptimal from the get-go: some downstream applications require more memory for in-context recall, while others require lower latency and memory. Further, these approaches rely on heuristic choices that artificially restrict attention, or require handcrafted and complex recurrent state update rules, or they must be carefully composed with attention at specific layers to form a hybrid architecture that complicates the design process, especially at scale. To address above issues, we propose…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The core concept of a single, adaptive model that can be adjusted at test time to trade quality for compute is a novel and potentially useful contribution. This provides a flexibility that is absent in most existing efficient architectures, which are fixed into a specific configuration at training time. 2. The paper is clearly written. The central mechanism of the CAT architecture is explained well and is easy to grasp, particularly with the help of Figure 2, which provides a simple visual o
1. The primary weakness is the invalid comparison in the main results tables (Tables 2, 3, 4, 5). The proposed CAT model has approximately 1B parameters. This model is compared against baselines (Dense, Mamba2, GDN) that have only 300M parameters. This 3x-4x parameter disparity invalidates any claims of superior efficiency or performance. The CAT model is not a more efficient architecture; it is a much larger model. 2. The paper's own parameter matched comparison, in Appendix A.3 (Table 7), dir
- The problem is well-defined and important - The idea of training with multiple chunk sizes to enable an adaptive model is useful and interesting, given that most efficient architectures choose the memory and compute budget a-priori - The results in Table 3 and Figure 3 are very interesting; for a similar latency to Mamba-2/GatedDeltaNet, CAT-8 offers much better quality. It’s also interesting that a single model is used for all the experiments, highlighting that the compute-quality can be tune
The parameter scales differ substantially (CAT ≈ 1 B vs. 300 M baseline), making it difficult to isolate the contribution of the architecture. Apples-to-apples scaling ablations would strengthen the case. Especially at such small parameter scales, it’s difficult to understand these trends. I will certainly consider increasing my score if I can better understand the impact of the parameter scale differences.
Originality * Simple, general recipe: *compress past, attend to compressed past + current chunk*; trains on multiple chunk sizes to enable test-time control without retraining. * Clear parallel training/generation story; no handcrafted recurrent updates. Quality * Provides implementation details (attention mask, KV reuse) and complexity accounting; decoder attention scales as ($O(N^2/C)$). * Reports broad benchmarks (LM, LongBench, EVAPORATE recall, RULER NIAH), with CAT $>=$ baselines under
1. Mismatched capacity in core tables: Baselines are \~300M params; CAT uses a wider 12-layer decoder + compressor (\~1B). This clouds “architecture vs scale” effects. Please add size-matched CAT (~300M) and/or scaled baselines (\~1B) in the *main* tables. 2. Training cost accounting is thin: The paper notes \~2× longer training but gives no FLOPs/wall-clock/memory breakdown. Add a table with total training FLOPs, hours on fixed hardware, and peak memory for CAT vs baselines. 3. Empirical diff
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Topic Modeling
