Parameter Efficient Multimodal Transformers for Video Representation   Learning

Sangho Lee; Youngjae Yu; Gunhee Kim; Thomas Breuel; Jan Kautz; Yale; Song

arXiv:2012.04124·cs.CV·September 23, 2021·36 cites

Parameter Efficient Multimodal Transformers for Video Representation Learning

Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale, Song

PDF

Open Access 1 Video

TL;DR

This paper introduces a parameter-efficient multimodal Transformer architecture for video representation learning, significantly reducing memory requirements and enabling end-to-end training from scratch on large-scale video datasets.

Contribution

It proposes a novel parameter sharing scheme based on low-rank approximation and a modality decomposition, allowing up to 97% parameter reduction and end-to-end training.

Findings

01

Achieves up to 97% reduction in Transformer parameters.

02

Enables training from scratch on large-scale video data.

03

Improves performance on audio-visual classification tasks.

Abstract

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Parameter Efficient Multimodal Transformers for Video Representation Learning· slideslive

Taxonomy

TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Adam · Dense Connections · Dropout · Attention Is All You Need · Softmax