Wolf: Dense Video Captioning with a World Summarization Framework

Boyi Li; Ligeng Zhu; Ran Tian; Shuhan Tan; Yuxiao Chen and; Yao Lu; Yin Cui; Sushant Veer; Max Ehrlich; Jonah Philion and; Xinshuo Weng; Fuzhao Xue; Linxi Fan; Yuke Zhu; Jan Kautz and; Andrew Tao; Ming-Yu Liu; Sanja Fidler; Boris Ivanovic; Trevor; Darrell; Jitendra Malik; Song Han; Marco Pavone

arXiv:2407.18908·cs.LG·March 21, 2025·1 cites

Wolf: Dense Video Captioning with a World Summarization Framework

Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen and, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion and, Xinshuo Weng, Fuzhao Xue, Linxi Fan, Yuke Zhu, Jan Kautz and, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor, Darrell, Jitendra Malik

PDF

Open Access

TL;DR

Wolf is a novel framework for dense video captioning that combines vision-language models and a new evaluation metric, CapScore, to improve caption quality across diverse domains and surpass existing methods.

Contribution

The paper introduces Wolf, a mixture-of-experts framework utilizing multiple vision-language models and a new LLM-based metric, CapScore, for enhanced and comprehensive video captioning.

Findings

01

Wolf outperforms state-of-the-art methods and commercial solutions in caption quality.

02

CapScore correlates well with human judgment of caption quality.

03

The framework is validated across datasets in autonomous driving, scenes, and robotics.

Abstract

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications