Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions
Jianan Wang, Boyang Li, Xiangyu Fan, Jing Lin, Yanwei Fu

TL;DR
This paper introduces techniques to improve data efficiency in multimodal sequence alignment by balancing gradient updates and feature distributions, achieving state-of-the-art results without pretraining.
Contribution
It proposes layer-wise adaptive rate scaling and sequence-wise batch normalization to enhance training stability and performance in multimodal alignment networks.
Findings
Techniques improve optimization and regularization.
Achieves state-of-the-art results on YouTube Movie Summary dataset.
Reduces reliance on pretraining.
Abstract
The task of video and text sequence alignment is a prerequisite step toward joint understanding of movie videos and screenplays. However, supervised methods face the obstacle of limited realistic training data. With this paper, we attempt to enhance data efficiency of the end-to-end alignment network NeuMATCH [15]. Recent research [56] suggests that network components dealing with different modalities may overfit and generalize at different speeds, creating difficulties for training. We propose to employ (1) layer-wise adaptive rate scaling (LARS) to align the magnitudes of gradient updates in different layers and balance the pace of learning and (2) sequence-wise batch normalization (SBN) to align the internal feature distributions from different modalities. Finally, we leverage random projection to reduce the dimensionality of input features. On the YouTube Movie Summary dataset, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsBatch Normalization
