MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation
Zuher Jahshan, Ben Ben Ishay, Leonid Yavits

TL;DR
MANAR introduces a memory-augmented attention mechanism inspired by cognitive theories, enabling efficient, scalable, and expressive contextualization in language, vision, and speech tasks by mimicking global workspace functions.
Contribution
It proposes a novel GWT-inspired attention architecture with a trainable memory and ACR, achieving linear-time complexity and enabling knowledge transfer from pretrained models.
Findings
Matches or exceeds baseline performance in language, vision, and speech tasks.
Achieves linear-time scaling, reducing quadratic complexity of standard attention.
Enables non-convex contextualization, allowing creative representation synthesis.
Abstract
MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Global Workspace Theory (GWT). While MHA enables unconstrained all-to-all communication, it lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness. MANAR addresses this by implementing a central workspace through a trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). The architecture follows a two-stage logic that maps directly to GWT mechanics: (i) an integration phase, where retrieved memory concepts converge to form a collective "mental image" (the ACR) based on input stimuli; and (ii) a broadcasting phase, where this global state navigates and informs the contextualization of individual local…
Peer Reviews
Decision·Submitted to ICLR 2026
N/A, see ethics comment & 'Weaknesses' section
As pointed out at the beginning of the reviewing phase, the margins of the paper unfortunately appear to have been significantly altered, which allows more space than the original template. I have to therefore recommend desk-rejection / rejection due to misuse of format.
-
-
- The idea is clearly present: a unification of retrieved global context (ACR) with local attention to avoid all-pairs attention. - Efficiency: Substantial wall-clock and HBM savings in microbenchmarks and end-to-end DeiT-S at large resolutions, with improvements growing with sequence length. - MANAR enables quick adoption and a large reduction in trainable parameters/steps while retaining accuracy on vision and speech.
- **Modest accuracy gains:** On ImageNet-1K, improvements over DeiT-B are small (82.3% vs. 81.8%). For ASR, the paper claims SOTA, but test-clean 2.9 trails data2vec (2.8) and test-other is tied at 6.8. - **Related work gaps:** While Linformer/Performer and long-sequence families (Mamba/RetNet, KV-cache management) are cited, several key lines are missing or under-discussed: sparse attention baselines, Swin/local-window ViTs, Transformer-XL/Compressive Transformer, standard retrieval-augmented m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAction Observation and Synchronization · Neurobiology of Language and Bilingualism · Embodied and Extended Cognition
