Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam; Ngan Ho; Xitong Yang; Tushar Nagarajan; Lorenzo; Torresani; Gedas Bertasius

arXiv:2402.13250·cs.CV·May 20, 2024·1 cites

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo, Torresani, Gedas Bertasius

PDF

Open Access 2 Repos

TL;DR

Video ReCap introduces a recursive hierarchical model capable of generating multi-level captions for videos ranging from seconds to hours, utilizing curriculum learning and a new long-video dataset.

Contribution

The paper presents a novel recursive captioning architecture for long videos, along with a curriculum learning scheme and the Ego4D-HCap dataset for hierarchical video summarization.

Findings

01

Effective processing of hour-long videos with hierarchical captions.

02

Improved performance on complex video understanding tasks.

03

Availability of new dataset for long-range video summarization.

Abstract

Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications