Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
Min Yang, Zichen Zhang, Limin Wang

TL;DR
Temporal2Seq introduces a unified sequence-based framework for multiple temporal video understanding tasks, enabling a single model to perform TAD, TAS, and GEBD with competitive results and strong generalization capabilities.
Contribution
It is the first to unify multiple temporal video understanding tasks into a single sequence-to-sequence framework with a comprehensive co-training dataset.
Findings
Temporal2Seq achieves competitive results across TAD, TAS, and GEBD tasks.
The unified model outperforms single-task models in multi-task settings.
The model demonstrates strong generalization to new datasets from different tasks.
Abstract
With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
