Unified Contrastive Fusion Transformer for Multimodal Human Action   Recognition

Kyoung Ok Yang; Junho Koh; Jun Won Choi

arXiv:2309.05032·cs.CV·September 12, 2023

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

Kyoung Ok Yang, Junho Koh, Jun Won Choi

PDF

Open Access

TL;DR

This paper introduces UCFFormer, a novel multimodal fusion transformer that leverages contrastive learning and a specialized attention mechanism to improve human action recognition across diverse sensor data, achieving state-of-the-art results.

Contribution

The paper proposes a unified transformer architecture with factorized attention and contrastive learning for effective multimodal data fusion in HAR, which is a novel approach.

Findings

01

UCFFormer outperforms existing methods on UTD-MHAD and NTU RGB+D datasets.

02

The model effectively reduces modality discrepancy through contrastive learning.

03

State-of-the-art accuracy achieved in multimodal human action recognition.

Abstract

Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings