Boosting Video Representation Learning with Multi-Faceted Integration

Zhaofan Qiu; Ting Yao; Chong-Wah Ngo; Xiao-Ping Zhang and; Dong Wu; Tao Mei

arXiv:2201.04023·cs.CV·January 12, 2022

Boosting Video Representation Learning with Multi-Faceted Integration

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xiao-Ping Zhang and, Dong Wu, Tao Mei

PDF

Open Access

TL;DR

This paper introduces MUFI, a novel framework for learning comprehensive video representations by integrating multifaceted labels from multiple datasets, significantly improving performance on various video understanding tasks.

Contribution

The paper proposes MUFI, a new multi-faceted integration framework that leverages diverse dataset labels to learn richer and more complete video representations.

Findings

01

MUFI improves action recognition accuracy on UCF101 and HMDB51 datasets.

02

MUFI enhances video captioning performance on MSVD dataset.

03

Learning from multiple facets yields more versatile video representations.

Abstract

Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

Methods3 Dimensional Convolutional Neural Network