Cross-Enhancement Transformer for Action Segmentation
Jiahui Wang, Zhenyou Wang, Shanna Zhuang, Hui Wang

TL;DR
This paper introduces a Cross-Enhancement Transformer with an encoder-decoder structure for action segmentation, effectively combining local and global information through self-attention to improve accuracy on multiple datasets.
Contribution
It proposes a novel Cross-Enhancement Transformer architecture with an interactive self-attention mechanism and a new loss function for better action segmentation.
Findings
Achieves state-of-the-art results on three challenging datasets
Effectively combines local and global features for frame recognition
Improves training with a new loss function to reduce over-segmentation errors
Abstract
Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsAttention Is All You Need · Hierarchical Transferability Calibration Network · Linear Layer · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Absolute Position Encodings · Byte Pair Encoding · Residual Connection
