Hierarchical Video-Moment Retrieval and Step-Captioning
Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas O\u{g}uz,, Yasher Mehdad, Mohit Bansal

TL;DR
This paper introduces HiREST, a new hierarchical benchmark and dataset for end-to-end video retrieval, moment segmentation, and step captioning in instructional videos, enabling integrated search and summarization.
Contribution
The paper presents a novel dataset and benchmark that unify video retrieval, moment segmentation, and step captioning tasks in instructional videos.
Findings
Baseline models show promising but limited performance.
Large room for improvement in hierarchical video understanding.
End-to-end models can enhance search and summarization capabilities.
Abstract
There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions. To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
