Multi-layer Learnable Attention Mask for Multimodal Tasks

Wayner Barrios; SouYoung Jin

arXiv:2406.02761·cs.CV·June 6, 2024

Multi-layer Learnable Attention Mask for Multimodal Tasks

Wayner Barrios, SouYoung Jin

PDF

Open Access

TL;DR

This paper introduces a Multi-layer Learnable Attention Mask (LAM) for Transformer models, improving performance and efficiency in multimodal tasks by regulating attention and emphasizing critical tokens across diverse data types.

Contribution

The paper proposes a novel multi-layer learnable attention mask that enhances Transformer models' ability to handle multimodal data by globally regulating attention and reducing redundant computations.

Findings

01

LAM improves model performance on multimodal datasets

02

LAM reduces computational complexity in Transformer models

03

Multi-layer LAM captures diverse information at different network layers

Abstract

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention