How Much Temporal Long-Term Context is Needed for Action Segmentation?

Emad Bahrami; Gianpiero Francesca; Juergen Gall

arXiv:2308.11358·cs.CV·September 26, 2023·1 cites

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Emad Bahrami, Gianpiero Francesca, Juergen Gall

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the amount of long-term temporal context needed for effective action segmentation in videos, proposing a sparse attention transformer model that captures full video context and outperforms existing methods.

Contribution

Introduces a transformer-based model with sparse attention to efficiently capture full video context for action segmentation, demonstrating its superiority over local-window approaches.

Findings

01

Full context modeling improves segmentation accuracy.

02

Sparse attention effectively captures long-term dependencies.

03

Model outperforms state-of-the-art on three datasets.

Abstract

Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ltcontext/ltcontext
pytorchOfficial

Videos

How Much Temporal Long-Term Context is Needed for Action Segmentation?· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications