Data-efficient Alignment of Multimodal Sequences by Aligning Gradient   Updates and Internal Feature Distributions

Jianan Wang; Boyang Li; Xiangyu Fan; Jing Lin; Yanwei Fu

arXiv:2011.07517·cs.CV·November 17, 2020

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

Jianan Wang, Boyang Li, Xiangyu Fan, Jing Lin, Yanwei Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces techniques to improve data efficiency in multimodal sequence alignment by balancing gradient updates and feature distributions, achieving state-of-the-art results without pretraining.

Contribution

It proposes layer-wise adaptive rate scaling and sequence-wise batch normalization to enhance training stability and performance in multimodal alignment networks.

Findings

01

Techniques improve optimization and regularization.

02

Achieves state-of-the-art results on YouTube Movie Summary dataset.

03

Reduces reliance on pretraining.

Abstract

The task of video and text sequence alignment is a prerequisite step toward joint understanding of movie videos and screenplays. However, supervised methods face the obstacle of limited realistic training data. With this paper, we attempt to enhance data efficiency of the end-to-end alignment network NeuMATCH [15]. Recent research [56] suggests that network components dealing with different modalities may overfit and generalize at different speeds, creating difficulties for training. We propose to employ (1) layer-wise adaptive rate scaling (LARS) to align the magnitudes of gradient updates in different layers and balance the pace of learning and (2) sequence-wise batch normalization (SBN) to align the internal feature distributions from different modalities. Finally, we leverage random projection to reduce the dimensionality of input features. On the YouTube Movie Summary dataset, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RubbyJ/Data-efficient-Alignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsBatch Normalization