Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

Koki Matsuishi; Kosuke Ukita; Tsuyoshi Okita

arXiv:2506.03174·cs.CV·June 5, 2025

Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

Koki Matsuishi, Kosuke Ukita, Tsuyoshi Okita

PDF

Open Access

TL;DR

This paper introduces AURA-MFM, a multimodal foundation model that integrates video, motion capture, IMU, and text data to improve detailed human activity analysis and recognition, especially in zero-shot scenarios.

Contribution

The paper presents a novel multimodal foundation model that combines four data modalities, including third-person video and motion capture, to enhance activity understanding beyond existing models.

Findings

01

Outperforms existing methods in retrieval and activity recognition tasks.

02

Achieves a zero-shot action recognition F1-score of 0.6226.

03

Zero-shot accuracy of 0.7320 significantly higher than previous approaches.

Abstract

In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Emotion and Mood Recognition