Multi-layer Learnable Attention Mask for Multimodal Tasks
Wayner Barrios, SouYoung Jin

TL;DR
This paper introduces a Multi-layer Learnable Attention Mask (LAM) for Transformer models, improving performance and efficiency in multimodal tasks by regulating attention and emphasizing critical tokens across diverse data types.
Contribution
The paper proposes a novel multi-layer learnable attention mask that enhances Transformer models' ability to handle multimodal data by globally regulating attention and reducing redundant computations.
Findings
LAM improves model performance on multimodal datasets
LAM reduces computational complexity in Transformer models
Multi-layer LAM captures diverse information at different network layers
Abstract
While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
