End-to-End Streaming Video Temporal Action Segmentation with Reinforce   Learning

Jinrong Zhang; Wujun Wen; Shenglan Liu; Yunheng Li; Qifeng Li; Lin; Feng

arXiv:2309.15683·cs.CV·May 24, 2024

End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning

Jinrong Zhang, Wujun Wen, Shenglan Liu, Yunheng Li, Qifeng Li, Lin, Feng

PDF

Open Access 1 Repo

TL;DR

This paper introduces SVTAS-RL, an end-to-end reinforcement learning-based model for streaming video temporal action segmentation, effectively addressing online segmentation challenges and outperforming existing methods on multiple datasets.

Contribution

The paper proposes a novel end-to-end streaming model with reinforcement learning to improve online temporal action segmentation performance.

Findings

01

SVTAS-RL outperforms existing STAS models significantly.

02

Achieves competitive results with state-of-the-art TAS models.

03

Demonstrates advantages on ultra-long video dataset EGTEA.

Abstract

The streaming temporal action segmentation (STAS) task, a supplementary task of temporal action segmentation (TAS), has not received adequate attention in the field of video understanding. Existing TAS methods are constrained to offline scenarios due to their heavy reliance on multimodal features and complete contextual information. The STAS task requires the model to classify each frame of the entire untrimmed video sequence clip by clip in time, thereby extending the applicability of TAS methods to online scenarios. However, directly applying existing TAS methods to SATS tasks results in significantly poor segmentation outcomes. In this paper, we thoroughly analyze the fundamental differences between STAS tasks and TAS tasks, attributing the severe performance degradation when transferring models to model bias and optimization dilemmas. We introduce an end-to-end streaming video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Thinksky5124/SVTAS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Video Surveillance and Tracking Methods

MethodsContrastive Language-Image Pre-training