Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Sanghyeok Chu, Seonguk Seo, Bohyung Han

TL;DR
This paper presents a novel graph consolidation framework for long video captioning that generates detailed, coherent descriptions without additional fine-tuning, outperforming existing methods in zero-shot settings.
Contribution
Introduces a graph consolidation approach that combines segment-level captions into a unified representation for long videos, enhancing temporal understanding without extra fine-tuning.
Findings
Outperforms existing LLM-based consolidation methods.
Achieves strong zero-shot performance.
Reduces computational costs significantly.
Abstract
Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
