How Much Temporal Long-Term Context is Needed for Action Segmentation?
Emad Bahrami, Gianpiero Francesca, Juergen Gall

TL;DR
This paper investigates the amount of long-term temporal context needed for effective action segmentation in videos, proposing a sparse attention transformer model that captures full video context and outperforms existing methods.
Contribution
Introduces a transformer-based model with sparse attention to efficiently capture full video context for action segmentation, demonstrating its superiority over local-window approaches.
Findings
Full context modeling improves segmentation accuracy.
Sparse attention effectively captures long-term dependencies.
Model outperforms state-of-the-art on three datasets.
Abstract
Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
How Much Temporal Long-Term Context is Needed for Action Segmentation?· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
