Deep Multimodal Feature Encoding for Video Ordering
Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen

TL;DR
This paper introduces a multimodal feature encoding approach for videos, leveraging frames, audio, and text to improve understanding and tasks like temporal ordering and action recognition.
Contribution
It proposes a novel joint multimodal feature learning method trained via a proxy task of video timeline ordering, with a new large-scale dataset for evaluation.
Findings
Multimodal representations outperform unimodal ones in video ordering.
Joint modalities provide complementary information enhancing action recognition.
The approach improves performance on challenging video understanding tasks.
Abstract
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
