AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan,, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning

TL;DR
AuroraCap introduces a simple yet effective large multimodal model for detailed video captioning, utilizing token merging to reduce computational overhead and establishing a new detailed captioning benchmark with an improved evaluation metric.
Contribution
The paper presents AuroraCap, a parameter-efficient video captioning model with a novel token merging strategy and introduces VDC, a comprehensive benchmark with an innovative evaluation metric.
Findings
AuroraCap achieves state-of-the-art performance on captioning benchmarks.
Token merging reduces input size with minimal performance loss.
VDCscore correlates better with human judgments.
Abstract
Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Multimodal Machine Learning Applications
