Towards Long Video Understanding via Fine-detailed Video Story Generation
Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang,, Mingkui Tan

TL;DR
This paper introduces FDVS, a method that converts long videos into detailed hierarchical textual descriptions, improving understanding by modeling fine-grained content and reducing redundancy, applicable across multiple tasks without fine-tuning.
Contribution
The paper proposes a novel hierarchical textual representation approach with a bottom-up interpretation mechanism and redundancy reduction, enabling versatile long video understanding.
Findings
Effective across eight datasets and three tasks.
Improves long-context relationship modeling.
Reduces redundancy at visual and textual levels.
Abstract
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications
