Towards Long Video Understanding via Fine-detailed Video Story   Generation

Zeng You; Zhiquan Wen; Yaofo Chen; Xin Li; Runhao Zeng; Yaowei Wang,; Mingkui Tan

arXiv:2412.06182·cs.CV·December 12, 2024

Towards Long Video Understanding via Fine-detailed Video Story Generation

Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang,, Mingkui Tan

PDF

Open Access

TL;DR

This paper introduces FDVS, a method that converts long videos into detailed hierarchical textual descriptions, improving understanding by modeling fine-grained content and reducing redundancy, applicable across multiple tasks without fine-tuning.

Contribution

The paper proposes a novel hierarchical textual representation approach with a bottom-up interpretation mechanism and redundancy reduction, enabling versatile long video understanding.

Findings

01

Effective across eight datasets and three tasks.

02

Improves long-context relationship modeling.

03

Reduces redundancy at visual and textual levels.

Abstract

Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications