Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale, Song

TL;DR
This paper introduces a parameter-efficient multimodal Transformer architecture for video representation learning, significantly reducing memory requirements and enabling end-to-end training from scratch on large-scale video datasets.
Contribution
It proposes a novel parameter sharing scheme based on low-rank approximation and a modality decomposition, allowing up to 97% parameter reduction and end-to-end training.
Findings
Achieves up to 97% reduction in Transformer parameters.
Enables training from scratch on large-scale video data.
Improves performance on audio-visual classification tasks.
Abstract
The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Multimodal Machine Learning Applications · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Adam · Dense Connections · Dropout · Attention Is All You Need · Softmax
