Memory Efficient Transformer Adapter for Dense Predictions
Dong Zhang, Rui Yan, Pingcheng Dong, Kwang-Ting Cheng

TL;DR
META is a memory-efficient Vision Transformer adapter that reduces memory access operations, improves inference speed, and enhances dense prediction tasks by sharing normalization, employing cross-shaped self-attention, and adding lightweight convolutional features.
Contribution
The paper introduces META, a novel ViT adapter that significantly improves memory efficiency and inference speed while boosting accuracy for dense prediction tasks.
Findings
Achieves state-of-the-art accuracy-efficiency trade-off.
Reduces memory consumption and inference time.
Enhances dense prediction performance across multiple datasets.
Abstract
While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model's memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model's reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive…
Peer Reviews
Decision·ICLR 2025 Poster
1. The method proposed in this work is simple but effective, achieving higher performance and efficiency in various classic detection and segmentation frameworks. 2. The paper provides clear and understandable descriptions of the details of each module in the MEA block, with the design purposes of each module being clear and effective.
1. There is still space on the main text pages, but the implementation parameters of the model are not clarified, such as the number of cascades. Different designs of each size are also not specified.
This paper presents a simple and fast ViT adapter named META, which addresses the critical yet underexplored issue of memory inefficiency. The quality of this paper is supported by theoretical foundations and empirical validations across various tasks and datasets, demonstrating that META outperforms state-of-the-art models in terms of accuracy and memory usage. The paper is structured clearly, with detailed architectural descriptions and clear explanations of the proposed motivation.
In the Atte Branch discussed in this paper, the adoption of the cross-shaped self-attention (CSA) mechanism is a pivotal factor in effectively reducing the frequent reshaping operations of the model. However, the current analysis lacks an in-depth comparison and discussion between CSA and other efficient attention mechanisms, failing to fully elaborate on why the selection of CSA achieves the current experimental results. The ablation analysis in this paper are currently limited to the results
1. META introduces a cross-shaped self-attention mechanism and a cascaded process, both of which are grounded in the principles of dividing the entire feature into multiple smaller features to reduce memory costs. 2. META incorporates local inductive biases by introducing convolutions into the FFN and an additional lightweight convolutional branch. This enables META to achieve better performance in extensive experimental evaluations.
1. Insufficient Motivation (1): META claims that the inference speed of previous adapters is hindered by inefficient memory access operations such as normalization and frequent reshaping, but it lacks experimental analysis to support this claim. It is recommended to provide a detailed breakdown of inference time to show the proportion of inefficient memory access operations in META and previous methods. 2. Insufficient Motivation (2): META aims to decrease memory access costs by reducing frequen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Label Smoothing · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
