Improving action segmentation via explicit similarity measurement

Kamel Aouaidjia; Wenhao Zhang; Aofan Li; Chongsheng Zhang

arXiv:2502.10713·cs.CV·February 18, 2025

Improving action segmentation via explicit similarity measurement

Kamel Aouaidjia, Wenhao Zhang, Aofan Li, Chongsheng Zhang

PDF

Open Access

TL;DR

This paper introduces ASESM, a novel action segmentation method that uses explicit similarity measurement and boundary correction to improve accuracy, outperforming existing approaches on multiple datasets.

Contribution

The paper proposes a new action segmentation framework that incorporates explicit similarity evaluation and a boundary correction algorithm, enhancing segmentation precision over prior methods.

Findings

01

Effective segmentation accuracy improvement demonstrated on three datasets.

02

Both supervised and unsupervised algorithms outperform existing methods.

03

Boundary correction and similarity voting significantly enhance boundary detection.

Abstract

Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax