Modality Mixer for Multi-modal Action Recognition

Sumin Lee; Sangmin Woo; Yeonju Park; Muhammad Adi Nugroho; and; Changick Kim

arXiv:2208.11314·cs.CV·February 22, 2023

Modality Mixer for Multi-modal Action Recognition

Sumin Lee, Sangmin Woo, Yeonju Park, Muhammad Adi Nugroho, and, Changick Kim

PDF

Open Access 1 Video

TL;DR

The paper introduces M-Mixer, a novel multi-modal action recognition network that effectively leverages complementary modalities and temporal context, outperforming existing methods on multiple benchmark datasets.

Contribution

It proposes the M-Mixer network with the Multi-modal Contextualization Unit (MCU), a new recurrent component that encodes temporal and cross-modal information for improved action recognition.

Findings

01

Outperforms state-of-the-art on NTU RGB+D 60, NTU RGB+D 120, NW-UCLA datasets.

02

Demonstrates the effectiveness of the MCU in encoding multi-modal temporal features.

03

Provides comprehensive ablation studies validating the approach.

Abstract

In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effective recurrent unit, called Multi-modal Contextualization Unit (MCU), which is a core component of M-Mixer. Our MCU temporally encodes a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth, IR). This process encourages M-Mixer to exploit global action content and also to supplement complementary information of other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Modality Mixer for Multi-modal Action Recognition· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems