Do we really need temporal convolutions in action segmentation?
Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

TL;DR
This paper introduces a pure Transformer-based model called Temporal U-Transformer (TUT) for action segmentation in videos, addressing limitations of temporal convolutions by incorporating temporal sampling and a boundary-aware loss.
Contribution
The paper proposes a novel Transformer architecture for action segmentation that eliminates temporal convolutions and introduces a boundary-aware loss to improve boundary recognition.
Findings
TUT outperforms convolution-based models on benchmark datasets.
The boundary-aware loss improves boundary detection accuracy.
Transformer-based approach reduces model complexity while maintaining performance.
Abstract
Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout
