AuroraCap: Efficient, Performant Video Detailed Captioning and a New   Benchmark

Wenhao Chai; Enxin Song; Yilun Du; Chenlin Meng; Vashisht Madhavan,; Omer Bar-Tal; Jenq-Neng Hwang; Saining Xie; Christopher D. Manning

arXiv:2410.03051·cs.CV·April 10, 2025

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan,, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning

PDF

Open Access 2 Models 3 Datasets

TL;DR

AuroraCap introduces a simple yet effective large multimodal model for detailed video captioning, utilizing token merging to reduce computational overhead and establishing a new detailed captioning benchmark with an improved evaluation metric.

Contribution

The paper presents AuroraCap, a parameter-efficient video captioning model with a novel token merging strategy and introduces VDC, a comprehensive benchmark with an innovative evaluation metric.

Findings

01

AuroraCap achieves state-of-the-art performance on captioning benchmarks.

02

Token merging reduces input size with minimal performance loss.

03

VDCscore correlates better with human judgments.

Abstract

Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Multimodal Machine Learning Applications