BIT: Bi-Level Temporal Modeling for Efficient Supervised Action   Segmentation

Zijia Lu; Ehsan Elhamifar

arXiv:2308.14900·cs.CV·October 10, 2023·2 cites

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

Zijia Lu, Ehsan Elhamifar

PDF

Open Access

TL;DR

The paper introduces BIT, an efficient bi-level temporal modeling framework for supervised action segmentation that captures long-range dependencies with lower computational cost and leverages textual transcripts when available.

Contribution

It proposes a novel bi-level architecture with explicit action tokens and cross-attention, enabling efficient long-range modeling and transcript integration for action segmentation.

Findings

01

Achieves state-of-the-art accuracy on four datasets.

02

Runs 30 times faster than existing transformer-based methods.

03

Effectively leverages textual transcripts to improve segmentation.

Abstract

We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsConvolution