# Multimodal image fusion for enhanced vehicle identification in intelligent transport

**Authors:** Naif Al Mudawi, Muhammad Waqas Ahmed, Haifa F. Alhasson, Naif S. Alshassari, Abdulwahab Alazeb, Mohammed Alshehri, Bayan Alabdullah

PMC · DOI: 10.7717/peerj-cs.3270 · 2025-10-30

## TL;DR

This paper introduces a deep learning-based multimodal image fusion method to improve vehicle detection in aerial imagery for intelligent transport systems.

## Contribution

A novel attention-based depth map generation and hybrid feature extraction technique for enhanced aerial vehicle detection.

## Key findings

- The proposed method achieved 98.4% precision on the Roundabout Aerial dataset for vehicle detection.
- Hybrid feature extraction using HOG and BRISK in ViT improved detection performance over existing methods.
- The model outperformed state-of-the-art approaches on three benchmark aerial datasets.

## Abstract

Target detection in remote sensing is essential for applications such as law enforcement, military surveillance, and search-and-rescue. With advancements in computational power, deep learning methods have excelled in processing unimodal aerial imagery. The availability of diverse imaging modalities including, infrared, hyperspectral, multispectral, synthetic aperture radar, and Light Detection and Ranging (LiDAR) allows researchers to leverage complementary data sources. Integrating these multi-modal datasets has significantly enhanced detection performance, making these technologies more effective in real-world scenarios. In this work, we propose a novel approach that employs a deep learning-based attention mechanism to generate depth maps from aerial images. These depth maps are fused with RGB images to achieve enhanced feature representation. For image segmentation, we use Markov Random Fields (MRF), and for object detection, we adopt the You Only Look Once (YOLOv4) framework. Furthermore, we introduce a hybrid feature extraction technique that combines Histogram of Oriented Gradients (HOG) and Binary Robust Invariant Scalable Keypoints (BRISK) descriptors within the Vision Transformer (ViT) framework. Finally, a Residual Network with 18 layers (ResNet-18) is used for classification. Our model is evaluated on three benchmark datasets Roundabout Aerial, AU-Air, and Vehicle Aerial Imagery Dataset (VAID) achieving precision scores of 98.4%, 96.2%, and 97.4%, respectively, for object detection. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in vehicle detection and classification for aerial imagery.

## Full-text entities

- **Diseases:** occlusion (MESH:D001157)
- **Chemicals:** ViT (-), PAN (MESH:C041728)

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12594218/full.md

---
Source: https://tomesphere.com/paper/PMC12594218