Improving action segmentation via explicit similarity measurement
Kamel Aouaidjia, Wenhao Zhang, Aofan Li, Chongsheng Zhang

TL;DR
This paper introduces ASESM, a novel action segmentation method that uses explicit similarity measurement and boundary correction to improve accuracy, outperforming existing approaches on multiple datasets.
Contribution
The paper proposes a new action segmentation framework that incorporates explicit similarity evaluation and a boundary correction algorithm, enhancing segmentation precision over prior methods.
Findings
Effective segmentation accuracy improvement demonstrated on three datasets.
Both supervised and unsupervised algorithms outperform existing methods.
Boundary correction and similarity voting significantly enhance boundary detection.
Abstract
Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
