Distilling Vision-Language Models on Millions of Videos

Yue Zhao; Long Zhao; Xingyi Zhou; Jialin Wu; Chun-Te Chu; Hui Miao,; Florian Schroff; Hartwig Adam; Ting Liu; Boqing Gong; Philipp Kr\"ahenb\"uhl,; Liangzhe Yuan

arXiv:2401.06129·cs.CV·April 17, 2024·2 cites

Distilling Vision-Language Models on Millions of Videos

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao,, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Kr\"ahenb\"uhl,, Liangzhe Yuan

PDF

Open Access

TL;DR

This paper introduces a method to create high-quality video captions by fine-tuning vision-language models with synthesized data, enabling improved video understanding and retrieval without extensive human-annotated datasets.

Contribution

We propose a novel video-instruction-tuning approach that leverages image-language models and auto-labeling to enhance video-language modeling and benchmark performance.

Findings

01

Outperforms previous methods on NExT-QA by 2.8%.

02

Achieves 6% improvement in zero-shot text-to-video retrieval on MSR-VTT.

03

Generates the largest video caption dataset to date.

Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling