MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Raghav Goyal; Effrosyni Mavroudi; Xitong Yang; Sainbayar Sukhbaatar,; Leonid Sigal; Matt Feiszli; Lorenzo Torresani; Du Tran

arXiv:2302.08063·cs.CV·March 21, 2023·1 cites

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar,, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

PDF

Open Access 1 Repo

TL;DR

This paper introduces MINOTAUR, a unified multi-task model for video understanding that handles various query types and tasks, improving performance through cross-task learning and generalizing to unseen tasks.

Contribution

The paper presents a single, versatile model capable of addressing multiple video grounding tasks with diverse queries, leveraging cross-task learning for enhanced performance.

Findings

01

Improved accuracy on the Ego4D Episodic Memory benchmark tasks.

02

Cross-task training enhances generalization to unseen tasks.

03

Model supports diverse input modalities and output structures.

Abstract

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raghavgoyal14/minotaur/tree/rgv/all-tasks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning