Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition
Kyoung Ok Yang, Junho Koh, Jun Won Choi

TL;DR
This paper introduces UCFFormer, a novel multimodal fusion transformer that leverages contrastive learning and a specialized attention mechanism to improve human action recognition across diverse sensor data, achieving state-of-the-art results.
Contribution
The paper proposes a unified transformer architecture with factorized attention and contrastive learning for effective multimodal data fusion in HAR, which is a novel approach.
Findings
UCFFormer outperforms existing methods on UTD-MHAD and NTU RGB+D datasets.
The model effectively reduces modality discrepancy through contrastive learning.
State-of-the-art accuracy achieved in multimodal human action recognition.
Abstract
Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings
