# A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition

**Authors:** Minkyung Jeon, Sungmin Woo

PMC · DOI: 10.3390/s26030894 · Sensors (Basel, Switzerland) · 2026-01-29

## TL;DR

This paper introduces a privacy-friendly model that combines radar and camera data to accurately recognize human activities in indoor settings.

## Contribution

A novel lightweight radar–camera fusion model using Transformer encoders for efficient and accurate human activity recognition.

## Key findings

- The fusion model achieves 98.74% classification accuracy across 15 activity classes.
- The model requires only 11 million floating-point operations, suitable for edge devices.
- The model outperforms single-modality radar and camera baselines significantly.

## Abstract

Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar with coarse spatial cues from ultra-low-resolution camera frames. The radar stream is processed as a Range–Doppler–Time cube, where each frame is flattened and sequentially encoded using a Transformer-based temporal model to capture fine-grained micro-Doppler patterns. The visual stream employs a privacy-preserving 4×5-pixel camera input, from which a temporal sequence of difference frames is extracted and modeled with a dedicated camera Transformer encoder. The two modality-specific feature vectors—each representing the temporal dynamics of motion—are concatenated and passed through a lightweight fully connected classifier to predict human activity categories. A multimodal dataset of synchronized radar cubes and ultra-low-resolution camera sequences across 15 activity classes was constructed for evaluation. Experimental results show that the proposed fusion model achieves 98.74% classification accuracy, significantly outperforming single-modality baselines (single-radar and single-camera). Despite its performance, the entire model requires only 11 million floating-point operations (11 MFLOPs), making it highly efficient for deployment on embedded or edge devices.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12899899/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12899899/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC12899899/full.md

---
Source: https://tomesphere.com/paper/PMC12899899