Avey-B
Devang Acharya, Mohammad Hammoud

TL;DR
This paper adapts the Avey autoregressive, attention-free model for encoder-only tasks, introducing architectural innovations that improve efficiency and performance on NLP benchmarks compared to traditional Transformer encoders.
Contribution
The paper presents a reformulation of Avey for encoder-only use, incorporating decoupled parameters, stability normalization, and neural compression, achieving better performance and scalability.
Findings
Outperforms four standard Transformer encoders on token classification and retrieval tasks.
Scales more efficiently to long contexts.
Maintains high-quality bidirectional contextualization without attention mechanisms.
Abstract
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while…
Peer Reviews
Decision·ICLR 2026 Poster
The architectural refinements are thoughtfully motivated. I especially appreciate the clarity in how the authors separate static and dynamic layers—this reflects careful reasoning about the pitfalls of coupled parameterization. The discussion around monotonicity provides theoretical substance rather than heuristic justification, and the normalization strategy is simple yet effective for improving training stability. The neural compression idea feels practical and grounded in real deployment conc
While the work is well executed, the novelty is somewhat incremental relative to the original Avey model. The decoupling and normalization ideas, though meaningful, read more as refinements than as a fundamentally new architecture. Efficiency comparisons would be stronger with a fused-kernel implementation to remove framework overhead. I also would have liked to see more analysis of how the compression affects representational quality or long-range dependency modeling, as well as sensitivity stu
* **Innovative architecture for efficient encoders**: The paper presents a commendable attempt to move beyond the dominant Transformer paradigm. The proposed Avey architecture represents a promising step toward more efficient, attention-free encoder designs. * **Potential for long-sequence applications**: The results suggest that the core ideas underlying Avey hold significant potential, particularly for tasks involving long sequences where computational efficiency is a primary concern.
* **Architectural limitation**: The ranker’s reliance on MaxSim over non-contextualized embeddings appears to be a major limitation. This design makes the crucial context-selection step purely lexical, preventing it from leveraging the deeper semantic representations learned in later layers. In edge cases where certain splits are repeated multiple times in a document, the ranker would likely retrieve these redundant splits to form the context. * **Significance assessment**: The main results in T
The insight of contextualizing splits efficiently through chunk-based retrieval is interesting (although not novel I believe). This paper however proposes an architecture which is entirely designed to properly contextualize external chunk information, and proposes various design ablations. Beyond long context processing, Avey-B showcases very strong results on short-context benchmarks, with an architecture quite different from traditional encoder transformers - even under limited data refimes.
**Introduction/Conclusion clarity**: The introduction is hard to read and quickly dives into details. It often contrasts Avey-B the model at hand with Avey, an autoregressive variant, but I don't think the reader should be expected to know how Avey works to be able to understand this paper easily. To illustrate, l60-67 are really hard to understand without skipping ahead, understanding the mechanisms and going back. More generally, I do not believe framing the entire paper as "an extension to A
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
