MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning
Rex Liu, Xin Liu

TL;DR
Mu-MAE introduces a self-supervised multimodal autoencoder with a novel masking strategy, enabling effective one-shot human activity recognition from video and sensor data without external datasets.
Contribution
The paper proposes Mu-MAE, a multimodal masked autoencoder with synchronized masking for self-supervised pretraining and a cross-attention fusion layer for improved one-shot classification.
Findings
Achieves up to 80.17% accuracy on MMAct one-shot classification
Outperforms existing approaches without external data
Effective spatiotemporal feature learning through novel masking strategy
Abstract
With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
