Multimodal Fusion and Interpretability in Human Activity Recognition: A Reproducible Framework for Sensor-Based Modeling

Yiyao Yang; Yasemin Gulbahar

arXiv:2510.22410·stat.AP·May 5, 2026

Multimodal Fusion and Interpretability in Human Activity Recognition: A Reproducible Framework for Sensor-Based Modeling

Yiyao Yang, Yasemin Gulbahar

PDF

TL;DR

This paper presents a reproducible, modular framework for multimodal human activity recognition that integrates sensor data preprocessing, fusion strategies, and interpretability analysis, demonstrating improved accuracy and insights.

Contribution

It introduces a unified preprocessing workflow, compares fusion methods, and evaluates interpretability, providing a transferable template for sensor-based activity modeling.

Findings

01

Late fusion achieves highest validation accuracy.

02

Hybrid fusion outperforms early fusion.

03

RFID signals significantly improve recognition performance.

Abstract

The research introduces a reproducible framework for transforming raw, heterogeneous sensor streams into aligned, semantically meaningful representations for multimodal human activity recognition. Grounded in the Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) database and focused on the naturalistic Subject 07 Brownie session, the study traces the full pipeline from data ingestion to modeling and interpretation. Unlike black box preprocessing, a unified preprocessing workflow is proposed that temporally aligns video, audio, and RFID through resampling, grayscale conversion, sliding-window segmentation, and modality-specific normalization, producing standardized fused tensors suitable for downstream learning. Building on this foundation, the work systematically compares early, late, and hybrid fusion strategies using LSTM-based models implemented with PyTorch and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.