Boosting Video Representation Learning with Multi-Faceted Integration
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xiao-Ping Zhang and, Dong Wu, Tao Mei

TL;DR
This paper introduces MUFI, a novel framework for learning comprehensive video representations by integrating multifaceted labels from multiple datasets, significantly improving performance on various video understanding tasks.
Contribution
The paper proposes MUFI, a new multi-faceted integration framework that leverages diverse dataset labels to learn richer and more complete video representations.
Findings
MUFI improves action recognition accuracy on UCF101 and HMDB51 datasets.
MUFI enhances video captioning performance on MSVD dataset.
Learning from multiple facets yields more versatile video representations.
Abstract
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
Methods3 Dimensional Convolutional Neural Network
