SNN-Driven Multimodal Human Action Recognition via Sparse Spatial-Temporal Data Fusion
Naichuan Zheng, Hailun Xia, Zeyu Liang, Yuchen Du

TL;DR
This paper introduces a novel SNN-based framework for multimodal human action recognition using event camera and skeleton data, achieving high accuracy and energy efficiency suitable for resource-limited scenarios.
Contribution
The paper presents a new SNN architecture with modality-specific backbones and a discretized information bottleneck for efficient multimodal data fusion, advancing resource-efficient action recognition.
Findings
Achieves superior recognition accuracy compared to existing methods.
Demonstrates significant energy efficiency improvements.
Validates effectiveness on a newly constructed multimodal dataset.
Abstract
Multimodal human action recognition based on RGB and skeleton data fusion, while effective, is constrained by significant limitations such as high computational complexity, excessive memory consumption, and substantial energy demands, particularly when implemented with Artificial Neural Networks (ANN). These limitations restrict its applicability in resource-constrained scenarios. To address these challenges, we propose a novel Spiking Neural Network (SNN)-driven framework for multimodal human action recognition, utilizing event camera and skeleton data. Our framework is centered on two key innovations: (1) a novel multimodal SNN architecture that employs distinct backbone networks for each modality-an SNN-based Mamba for event camera data and a Spiking Graph Convolutional Network (SGN) for skeleton data-combined with a spiking semantic extraction module to capture deep semantic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper introduces the first multimodal SNN framework for human action recognition, representing a novel direction in neuromorphic computing. The use of event and skeleton modalities is well-motivated, making them well-suited for low-power, energy-efficient computation on edge devices. 2. The paper is technically thorough and clearly presented. 3. Achieves state-of-the-art SNN accuracy with drastically reduced energy consumption compared to ANN baseline.
1. My main concern lies in the degree of technical novelty. Each component (Mamba, SNNs, and the Information Bottleneck) appears to be based on existing techniques, and the overall contribution could be viewed as a careful integration rather than a fundamentally new design. Could the authors clarify what specific aspects of the proposed framework go beyond a modular combination of known components? 2. Although the paper reports improved fusion accuracy and energy efficiency, it provides limite
1. Event/skeleton are both sparse temporal modalities; an SNN‑native fusion is a coherent direction. 2. Module‑wise gains are cleanly reported, and the DIB variants are systematically explored.
1. The highest Xs achieved by your model on NRD/NRD-120 is 85.0/74.6, which is substantially lower than the best-performing ANNs, such as VPN at 93.5/86.3, and MMNet at 94.2/92.9. 2. ANN models operating on the same magnitude of computational cost also perform better, eg., CTR-GCN at 89.9/84.9 with 1.97 G FLOPs, and Shift-GCN at 87.8/80.9 with 2.5 G FLOPs. The efficiency gain claim is week.
--Quality: Strong empirical results, thorough ablation, and theoretical grounding. --Significance: Demonstrates a practical pathway for low-power multimodal recognition on edge devices. --Clarity: The overall pipeline and experimental section are well-structured and described.
Motivation: The introduction does not convincingly establish a strong "why now" or "why this way" for the proposed method. The limitations of prior ANN and SNN works are stated but not used to build a powerful narrative for the current approach. Originality: The architectural innovations (SCM, DIB) feel more like competent engineering integrations of existing ideas (cross-attention, Mamba, IB) into the SNN domain, rather than a fundamental conceptual breakthrough. Presentation: Inconsistent re
1. First work to explore SNN-based multimodal fusion for action recognition, combining event and skeleton modalities 2. Comprehensive and competitive results. Most of the experiments achieve higher performance than previous works with iso-parameter architecture. Furthermore, authors implement extensive ablation studies and analysis. 3. Appendix A rigorously analyzes why classical Gaussian IB fails for SNNs and justifies the DIB formulation with discrete KL divergence and cosine surrogates.
1. Novelty - While the authors claim this as the first SNN-based multimodal action recognition framework, the novelty is questionable. Except for the DIB module, all components are directly adopted from prior works with minimal modification (Spiking Mamba, SGN, etc.). The contribution essentially reduces to replacing activation functions with spiking neurons and introducing DIB. This appears more like an ad-hoc engineering integration rather than a fundamental methodological advance. 2. Pseudo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Age of Information Optimization · Advanced Memory and Neural Computing
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Spiking Neural Networks
