Pretrained Image-Text Models are Secretly Video Captioners

Chunhui Zhang; Yiren Jian; Zhongyu Ouyang; Soroush Vosoughi

arXiv:2502.13363·cs.CV·February 20, 2025

Pretrained Image-Text Models are Secretly Video Captioners

Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, Soroush Vosoughi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper shows that a simple adaptation of pretrained image-text models, using minimal data and resources, can effectively perform video captioning, rivaling specialized systems on major benchmarks.

Contribution

It introduces a resource-efficient method to repurpose image captioning models for video captioning without complex modifications.

Findings

01

Achieved top-tier performance on MSRVTT, MSVD, and VATEX benchmarks.

02

Used only 6,000 video text pairs for adaptation, significantly less than other methods.

03

Demonstrated that lightweight, image-based models can rival state-of-the-art video captioners.

Abstract

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chunhuizng/mllm-video-captioner
pytorchOfficial

Videos

Pretrained Image-Text Models are Secretly Video Captioners· underline

Taxonomy

TopicsDigital Media Forensic Detection