Weakly Supervised Video Representation Learning with Unaligned Text for   Sequential Videos

Sixun Dong; Huazhang Hu; Dongze Lian; Weixin Luo; Yicheng Qian,; Shenghua Gao

arXiv:2303.12370·cs.CV·March 29, 2023·1 cites

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian,, Shenghua Gao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a weakly supervised method for sequential video understanding that leverages unaligned text and pseudo labels, employing contrastive losses and a transformer-based video encoder to improve text-video matching.

Contribution

It proposes a novel weakly supervised framework using pseudo frame-sentence correspondence and multiple granularity contrastive losses for sequential video understanding.

Findings

01

Outperforms baseline methods significantly in video sequence verification.

02

Effective in text-to-video matching tasks with unaligned text.

03

Validates the use of pseudo labels for weakly supervised learning.

Abstract

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

svip-lab/weaksvr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training