End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning
Jinrong Zhang, Wujun Wen, Shenglan Liu, Yunheng Li, Qifeng Li, Lin, Feng

TL;DR
This paper introduces SVTAS-RL, an end-to-end reinforcement learning-based model for streaming video temporal action segmentation, effectively addressing online segmentation challenges and outperforming existing methods on multiple datasets.
Contribution
The paper proposes a novel end-to-end streaming model with reinforcement learning to improve online temporal action segmentation performance.
Findings
SVTAS-RL outperforms existing STAS models significantly.
Achieves competitive results with state-of-the-art TAS models.
Demonstrates advantages on ultra-long video dataset EGTEA.
Abstract
The streaming temporal action segmentation (STAS) task, a supplementary task of temporal action segmentation (TAS), has not received adequate attention in the field of video understanding. Existing TAS methods are constrained to offline scenarios due to their heavy reliance on multimodal features and complete contextual information. The STAS task requires the model to classify each frame of the entire untrimmed video sequence clip by clip in time, thereby extending the applicability of TAS methods to online scenarios. However, directly applying existing TAS methods to SATS tasks results in significantly poor segmentation outcomes. In this paper, we thoroughly analyze the fundamental differences between STAS tasks and TAS tasks, attributing the severe performance degradation when transferring models to model bias and optimization dilemmas. We introduce an end-to-end streaming video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Video Surveillance and Tracking Methods
MethodsContrastive Language-Image Pre-training
