Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Min Yang; Zichen Zhang; Limin Wang

arXiv:2409.18478·cs.CV·September 30, 2024

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Min Yang, Zichen Zhang, Limin Wang

PDF

Open Access

TL;DR

Temporal2Seq introduces a unified sequence-based framework for multiple temporal video understanding tasks, enabling a single model to perform TAD, TAS, and GEBD with competitive results and strong generalization capabilities.

Contribution

It is the first to unify multiple temporal video understanding tasks into a single sequence-to-sequence framework with a comprehensive co-training dataset.

Findings

01

Temporal2Seq achieves competitive results across TAD, TAS, and GEBD tasks.

02

The unified model outperforms single-task models in multi-task settings.

03

The model demonstrates strong generalization to new datasets from different tasks.

Abstract

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)