Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Sakib Reza; Balaji Sundareshan; Mohsen Moghaddam; Octavia Camps

arXiv:2305.11365·cs.CV·May 25, 2023·2 cites

Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Sakib Reza, Balaji Sundareshan, Mohsen Moghaddam, Octavia Camps

PDF

Open Access

TL;DR

This paper proposes enhancements to transformer models for egocentric video action segmentation by introducing dual dilated attention and cross-connections, leading to improved performance on benchmark datasets.

Contribution

It introduces a dual dilated attention mechanism and cross-connections in transformers, along with leveraging visual-language features, to improve egocentric video action segmentation.

Findings

01

Outperforms state-of-the-art on GTEA and HOI4D datasets

02

Demonstrates effectiveness of dual dilated attention and cross-connections

03

Ablation studies validate component contributions

Abstract

Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. We also utilize state-of-the-art visual-language representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications