Bi-Calibration Networks for Weakly-Supervised Video Representation Learning
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei

TL;DR
This paper introduces Bi-Calibration Networks (BCN), a novel weakly-supervised video representation learning method leveraging web videos and textual data, achieving superior downstream task performance.
Contribution
The paper proposes a new mutual calibration approach between query and text, along with large-scale web video datasets, to improve weakly-supervised video representation learning.
Findings
BCN outperforms state-of-the-art methods on downstream tasks.
Large-scale datasets YOVO-3M and YOVO-10M enable effective training.
Fine-tuning on 10M videos yields 1.6-1.8% accuracy improvements.
Abstract
The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. The query vocabulary is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsTemporaral Difference Network
