Video ReCap: Recursive Captioning of Hour-Long Videos
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo, Torresani, Gedas Bertasius

TL;DR
Video ReCap introduces a recursive hierarchical model capable of generating multi-level captions for videos ranging from seconds to hours, utilizing curriculum learning and a new long-video dataset.
Contribution
The paper presents a novel recursive captioning architecture for long videos, along with a curriculum learning scheme and the Ego4D-HCap dataset for hierarchical video summarization.
Findings
Effective processing of hour-long videos with hierarchical captions.
Improved performance on complex video understanding tasks.
Availability of new dataset for long-range video summarization.
Abstract
Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimodal Machine Learning Applications
