BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
Zijia Lu, Ehsan Elhamifar

TL;DR
The paper introduces BIT, an efficient bi-level temporal modeling framework for supervised action segmentation that captures long-range dependencies with lower computational cost and leverages textual transcripts when available.
Contribution
It proposes a novel bi-level architecture with explicit action tokens and cross-attention, enabling efficient long-range modeling and transcript integration for action segmentation.
Findings
Achieves state-of-the-art accuracy on four datasets.
Runs 30 times faster than existing transformer-based methods.
Effectively leverages textual transcripts to improve segmentation.
Abstract
We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsConvolution
