MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar,, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

TL;DR
This paper introduces MINOTAUR, a unified multi-task model for video understanding that handles various query types and tasks, improving performance through cross-task learning and generalizing to unseen tasks.
Contribution
The paper presents a single, versatile model capable of addressing multiple video grounding tasks with diverse queries, leveraging cross-task learning for enhanced performance.
Findings
Improved accuracy on the Ego4D Episodic Memory benchmark tasks.
Cross-task training enhances generalization to unseen tasks.
Model supports diverse input modalities and output structures.
Abstract
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
