Efficient Temporal Action Segmentation via Boundary-aware Query Voting
Peiyao Wang, Yuewei Lin, Erik Blasch, Jie Wei, Haibin Ling

TL;DR
BaFormer introduces a boundary-aware Transformer for efficient temporal action segmentation, achieving comparable or better accuracy with only 6% of the computational cost of previous state-of-the-art methods.
Contribution
The paper proposes BaFormer, a novel boundary-aware Transformer that simplifies and accelerates temporal action segmentation through a single-stage, boundary-aware approach with a voting strategy.
Findings
BaFormer reduces computational time to 6% of DiffAct.
It achieves comparable or better accuracy on benchmark datasets.
The method effectively balances efficiency and performance.
Abstract
Although the performance of Temporal Action Segmentation (TAS) has improved in recent years, achieving promising results often comes with a high computational cost due to dense inputs, complex model structures, and resource-intensive post-processing requirements. To improve the efficiency while keeping the performance, we present a novel perspective centered on per-segment classification. By harnessing the capabilities of Transformers, we tokenize each video segment as an instance token, endowed with intrinsic instance segmentation. To realize efficient action segmentation, we introduce BaFormer, a boundary-aware Transformer network. It employs instance queries for instance segmentation and a global query for class-agnostic boundary prediction, yielding continuous segment proposals. During inference, BaFormer employs a simple yet effective voting strategy to classify boundary-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Anomaly Detection Techniques and Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
