Do we really need temporal convolutions in action segmentation?

Dazhao Du; Bing Su; Yu Li; Zhongang Qi; Lingyu Si; Ying Shan

arXiv:2205.13425·cs.CV·November 23, 2022

Do we really need temporal convolutions in action segmentation?

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a pure Transformer-based model called Temporal U-Transformer (TUT) for action segmentation in videos, addressing limitations of temporal convolutions by incorporating temporal sampling and a boundary-aware loss.

Contribution

The paper proposes a novel Transformer architecture for action segmentation that eliminates temporal convolutions and introduces a boundary-aware loss to improve boundary recognition.

Findings

01

TUT outperforms convolution-based models on benchmark datasets.

02

The boundary-aware loss improves boundary detection accuracy.

03

Transformer-based approach reduces model complexity while maintaining performance.

Abstract

Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ddz16/TUT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout