Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu; Seonguk Seo; Bohyung Han

arXiv:2502.16427·cs.CV·July 8, 2025

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu, Seonguk Seo, Bohyung Han

PDF

Open Access 1 Video

TL;DR

This paper presents a novel graph consolidation framework for long video captioning that generates detailed, coherent descriptions without additional fine-tuning, outperforming existing methods in zero-shot settings.

Contribution

Introduces a graph consolidation approach that combines segment-level captions into a unified representation for long videos, enhancing temporal understanding without extra fine-tuning.

Findings

01

Outperforms existing LLM-based consolidation methods.

02

Achieves strong zero-shot performance.

03

Reduces computational costs significantly.

Abstract

Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques