MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition
Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin, Liu, Yuanchun Shi

TL;DR
This paper introduces MMTSA, an efficient multimodal neural network for human activity recognition that fuses RGB and IMU data, achieving higher accuracy and lower computational cost on public datasets.
Contribution
The paper presents MMTSA, a novel neural architecture that transforms IMU data into images, employs sparse sampling, and uses inter-segment attention for improved multimodal fusion in HAR.
Findings
Achieved 11.13% higher cross-subject F1-score on MMAct dataset.
Demonstrated superior efficiency with lower computational load and latency.
Proved effectiveness of multimodal fusion and sparse sampling in HAR.
Abstract
Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Human Pose and Action Recognition · Advanced Technologies in Various Fields
