Ultra-Sparse Memory Network
Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo,, Xun Zhou

TL;DR
This paper introduces UltraMem, an ultra-sparse memory layer for Transformers, significantly reducing inference latency and improving performance, enabling scalable models with billions of memory slots or experts.
Contribution
The paper presents UltraMem, a novel ultra-sparse memory architecture that outperforms Mixture of Experts in speed and scalability while maintaining high performance.
Findings
UltraMem achieves state-of-the-art inference speed.
UltraMem maintains high model performance with large-scale memory.
Scaling laws favor UltraMem over MoE.
Abstract
It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms MoE. In experiments, the largest UltraMem we train has 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget,…
Peer Reviews
Decision·ICLR 2025 Poster
* Memory access is perhaps the main bottleneck in inference for contemporary hardware, and so this is a well-motivated and impactful direction to be exploring. * The results appear to show impressive improvements over Mixtures of Experts (MoEs) in inference time and memory access while maintaining validation loss/perplexity. * The authors methodology is detailed and thoughtful, for example deriving the correct initialization for their method (which would appear to also apply somewhat to PKM). *
* Overall the paper reads poorly, more as a collection of experimental details and results rather than a cohesive story. I know that's frustrating and perhaps vague feedback to get as an author but I think it's really important to point out as it makes the paper quite hard to read as it is, and reduces the impact of the author's work. To be clear, I do appreciate the experimental and methodological details themselves. However, I believe the authors could do the work much more justice by taking a
1. This paper introduces the UltraMem architecture, which demonstrates innovation by significantly reducing inference latency while maintaining computational efficiency. 2. The paper presents extensive experiments comparing the performance of UltraMem with traditional models (such as MoE and dense models), verifying UltraMem’s advantages in inference speed, memory access costs, and scalability. 3. The experiments show that UltraMem’s memory access volume grows much more slowly with batch size
1. The paper lacks references and descriptions of the architecture diagrams within the main text. For example, each step in Figure 4 is not referenced in the text. Additionally, certain terms in the architecture diagrams, such as “fetch values,” are not explained in the text, making the paper difficult to follow. 2. The paper claims that the proposed UltraMem method has stronger scalability. However, UltraMem was only tested on models with 151M, 680M, and 1.6B parameters, without experiments on
This paper addresses an interesting and impactful problem with practical applications in large language model (LLM) research. - The authors propose an alternative, memory-efficient approach to achieving the performance of Mixture of Experts (MoE) models. - They introduce a sparse memory access mechanism and a 2D Product Key Memory structure, which restricts memory access to only the most relevant slots, enhancing efficiency. - **Scalability**: The use of Tucker Decomposition improves the scalab
Although the idea presented in this paper is novel, the experimental performance assessment shows some limitations: 1. **Lack of a Proper Baseline**: The authors do not use a well-established MoE baseline to evaluate the performance of their method. Since UltraMem is proposed as a memory-efficient alternative to MoE, it would be beneficial to compare it against real-world MoE implementations from existing open-source models, providing a more comprehensive analysis. 2. **Limited Experiments wit
Videos
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Machine Learning and ELM
MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
