# A Hybrid Millimeter-Wave Radar–Ultrasonic Fusion System for Robust Human Activity Recognition with Attention-Enhanced Deep Learning

**Authors:** Liping Yao, Kwok L. Chung, Luxin Tang, Tao Ye, Shiquan Wang, Pingchuan Xu, Yuhao Bi, Yaowen Wu

PMC · DOI: 10.3390/s26031057 · Sensors (Basel, Switzerland) · 2026-02-06

## TL;DR

A new system combining radar and ultrasound with deep learning accurately recognizes human activities like standing, sitting, walking, and falling without invading privacy.

## Contribution

The hybrid radar-ultrasonic system with Attention-CNN-BiLSTM architecture achieves 98.6% accuracy in human activity recognition, overcoming single-sensor limitations.

## Key findings

- Fusing mmWave radar and ultrasound with wavelet/STFT features enables 98.6% accurate recognition of four human behaviors.
- The Attention-CNN-BiLSTM model outperforms single-sensor and traditional deep learning baselines.
- The system is privacy-preserving, lighting-agnostic, and suitable for smart homes and healthcare.

## Abstract

What are the main findings?
Fusing 77 GHz millimeter-wave radar and 40 kHz ultrasonic signals (with wavelet transform for radar and STFT for ultrasound) overcomes the range-vs-accuracy tradeoff of single-sensor systems, enabling 98.6% accurate recognition of four core human behaviors (standing, sitting, walking, falling) in a privacy-preserving, lighting-agnostic manner.The proposed Attention-CNN-BiLSTM architecture—integrating CNN (local spatial features), BiLSTM (bidirectional temporal dependencies), and attention (salient cue enhancement)—outperforms single-sensor baselines and traditional deep learning models, providing a robust technical solution for contactless human behavior recognition.

Fusing 77 GHz millimeter-wave radar and 40 kHz ultrasonic signals (with wavelet transform for radar and STFT for ultrasound) overcomes the range-vs-accuracy tradeoff of single-sensor systems, enabling 98.6% accurate recognition of four core human behaviors (standing, sitting, walking, falling) in a privacy-preserving, lighting-agnostic manner.

The proposed Attention-CNN-BiLSTM architecture—integrating CNN (local spatial features), BiLSTM (bidirectional temporal dependencies), and attention (salient cue enhancement)—outperforms single-sensor baselines and traditional deep learning models, providing a robust technical solution for contactless human behavior recognition.

What are the implications of the main findings?
The mmWave radar-ultrasonic fusion paradigm provides a privacy-preserving, environment-robust solution for contactless human behavior recognition, with direct implications for advancing smart home monitoring, elderly healthcare, and privacy-sensitive surveillance systems.The integration of targeted time–frequency feature extraction (wavelet/STFT) and the Attention-CNN-BiLSTM architecture offers a scalable methodological framework for addressing the range-vs-accuracy tradeoff in multi-modal sensing, informing future research on contactless activity recognition.

The mmWave radar-ultrasonic fusion paradigm provides a privacy-preserving, environment-robust solution for contactless human behavior recognition, with direct implications for advancing smart home monitoring, elderly healthcare, and privacy-sensitive surveillance systems.

The integration of targeted time–frequency feature extraction (wavelet/STFT) and the Attention-CNN-BiLSTM architecture offers a scalable methodological framework for addressing the range-vs-accuracy tradeoff in multi-modal sensing, informing future research on contactless activity recognition.

To address the tradeoff between environmental robustness and fine-grained accuracy in single-sensor human behavior recognition, this paper proposes a non-contact system fusing 77 GHz SIFT (mmWave) radar and a 40 kHz ultrasonic array. The system leverages radar’s long-range penetration and low-visibility adaptability, paired with ultrasound’s centimeter-level short-range precision and electromagnetic clutter immunity. A synchronized data acquisition platform ensures multi-modal signal consistency, while wavelet transform (for radar) and STFT (for ultrasound) extract complementary time–frequency features. The proposed Attention-CNN-BiLSTM architecture integrates local spatial feature extraction, bidirectional temporal dependency modeling, and salient cue enhancement. Experimental results on 1600 synchronized sequences (four behaviors: standing, sitting, walking, falling) show a 98.6% mean class accuracy with subject-wise generalization, outperforming single-sensor baselines and traditional deep learning models. As a privacy-preserving, lighting-agnostic solution, it offers promising applications in smart homes, healthcare monitoring, and intelligent surveillance, providing a robust technical foundation for contactless behavior recognition.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12899820/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12899820/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12899820/full.md

---
Source: https://tomesphere.com/paper/PMC12899820