Multimodal Fusion and Interpretability in Human Activity Recognition: A Reproducible Framework for Sensor-Based Modeling
Yiyao Yang, Yasemin Gulbahar

TL;DR
This paper presents a reproducible, modular framework for multimodal human activity recognition that integrates sensor data preprocessing, fusion strategies, and interpretability analysis, demonstrating improved accuracy and insights.
Contribution
It introduces a unified preprocessing workflow, compares fusion methods, and evaluates interpretability, providing a transferable template for sensor-based activity modeling.
Findings
Late fusion achieves highest validation accuracy.
Hybrid fusion outperforms early fusion.
RFID signals significantly improve recognition performance.
Abstract
The research introduces a reproducible framework for transforming raw, heterogeneous sensor streams into aligned, semantically meaningful representations for multimodal human activity recognition. Grounded in the Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) database and focused on the naturalistic Subject 07 Brownie session, the study traces the full pipeline from data ingestion to modeling and interpretation. Unlike black box preprocessing, a unified preprocessing workflow is proposed that temporally aligns video, audio, and RFID through resampling, grayscale conversion, sliding-window segmentation, and modality-specific normalization, producing standardized fused tensors suitable for downstream learning. Building on this foundation, the work systematically compares early, late, and hybrid fusion strategies using LSTM-based models implemented with PyTorch and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
