LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models
Hongchen Wei, Zhihong Tan, Yaosi Hu, Chang Wen Chen, Zhenzhong Chen

TL;DR
This paper addresses the challenge of generating detailed long captions for long videos using Large Multimodal Models, proposing a new data synthesis framework and benchmark to improve and evaluate long caption generation.
Contribution
We introduce LongCaption-Agent, a hierarchical data synthesis framework, and create LongCaption-10K and LongCaption-Bench datasets to enhance and assess long video captioning capabilities.
Findings
LMMs struggle to generate captions over 300 words for long videos.
Training with synthesized long-caption data enables models to produce over 1,000 words.
Our approach achieves state-of-the-art results on long captioning benchmarks.
Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
