Attention Is All You Need For Mixture-of-Depths Routing
Advait Gadhikar, Souptik Kumar Majumdar, Niclas Popp, Piyapat, Saranrittichai, Martin Rapp, Lukas Schott

TL;DR
This paper introduces A-MoD, an attention-based routing mechanism for Mixture-of-Depths models that improves training efficiency and accuracy without additional parameters, enhancing large model deployment.
Contribution
A-MoD leverages existing attention maps for routing, eliminating extra trainable layers and simplifying training of MoD models.
Findings
Up to 2% higher accuracy on ImageNet
2x faster transfer learning
Simplifies training without extra parameters
Abstract
Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper explores the token importance evaluation through the existing attention maps and thus reduce the overhead of extra layers.
1. There is no comparison with the current SOTA models since DeiT/ViT are relatively old. 2. The idea is quite similar to the following papers. It would be great to include the comparison (such as accuracy and latency) with these papers. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification and Token Merging: Your ViT But Faster
- Using the transformer attention in the MoD routing makes sense. - Extensive results do prove that the proposed A-MoD outperforms MoD.
- The entire study is too narrow and limited. This paper is specifically targeted at MoD. However, MoD is just an arxiv paper. Does MoD represent the SoTA in terms of the Pareto frontier? All the experiments are mainly compared with MoD? What about other SoTA methods? BTW, MoD is not impressive in Table 1, where it is even inferior to a simple baseline, isoFLOP. - Why is higher average attention in (4) corresponding to higher importance? What is the semantic meaning? If so, shouldn't the backgr
1. The authors tackle an important issue in improving the efficiency of ViTs by proposing a method that dynamically selects the most relevant tokens for computation. This approach aligns with similar techniques in the field, such as A-ViT, which also prioritize token selection for efficiency gains. 2. Additionally, the experimental results show promising performance.
The main idea behind the proposed method makes sense to me, but I have several concerns, particularly regarding the experiments and their settings. 1) First, the authors mention in Line 234 that they continue training from a previous checkpoint for an additional 100 epochs. I’m curious whether this approach is justified. Why not simply start training from scratch? 2) I also find the transfer learning setup a bit confusing, especially since the authors do not use fixed pretrained weights. Why
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization · VLSI and FPGA Design Techniques · Optimization and Packing Problems
MethodsSoftmax · Attention Is All You Need
