Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian,, Shenghua Gao

TL;DR
This paper introduces a weakly supervised method for sequential video understanding that leverages unaligned text and pseudo labels, employing contrastive losses and a transformer-based video encoder to improve text-video matching.
Contribution
It proposes a novel weakly supervised framework using pseudo frame-sentence correspondence and multiple granularity contrastive losses for sequential video understanding.
Findings
Outperforms baseline methods significantly in video sequence verification.
Effective in text-to-video matching tasks with unaligned text.
Validates the use of pseudo labels for weakly supervised learning.
Abstract
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
