Distilling Vision-Language Models on Millions of Videos
Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao,, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Kr\"ahenb\"uhl,, Liangzhe Yuan

TL;DR
This paper introduces a method to create high-quality video captions by fine-tuning vision-language models with synthesized data, enabling improved video understanding and retrieval without extensive human-annotated datasets.
Contribution
We propose a novel video-instruction-tuning approach that leverages image-language models and auto-labeling to enhance video-language modeling and benchmark performance.
Findings
Outperforms previous methods on NExT-QA by 2.8%.
Achieves 6% improvement in zero-shot text-to-video retrieval on MSR-VTT.
Generates the largest video caption dataset to date.
Abstract
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
