# The evolution of object detection from CNNs to transformers and multi-modal fusion

**Authors:** Zeran Wang, Yuan Chen, Yuhao Gu, Jian Liu, Xudong Zhu, Mianwang He

PMC · DOI: 10.1038/s41598-026-37052-6 · Scientific Reports · 2026-02-05

## TL;DR

This survey compares CNNs and Transformers for object detection, highlights their strengths, and explores multi-modal fusion techniques.

## Contribution

A novel taxonomy of multi-modal fusion strategies and updated benchmarks from 2024 to 2025.

## Key findings

- Transformers outperform CNNs in global context modeling but lag in real-time performance.
- Multi-modal fusion improves detection accuracy by integrating RGB, LiDAR, and language embeddings.
- Real-time detectors achieve over 100 FPS with competitive accuracy.

## Abstract

Object detection, a cornerstone of computer vision, aims to localize and classify objects within images. This comprehensive survey reviews modern object detection methods, focusing on two dominant paradigms: Convolutional Neural Networks (CNNs) and Transformer-based architectures. This work provides a structured comparison of CNN-based and Transformer-based detection paradigms, highlighting their complementary strengths and trade-offs. CNNs demonstrate advantages in local feature extraction and computational efficiency, whereas Transformers excel at capturing global context through self-attention mechanisms. We also analyze multi-modal fusion techniques integrating Red-Green-Blue (RGB), Light Detection and Ranging (LiDAR), and language embeddings. Benchmark results from representative models include: Real-Time Detection Transformer (RT-DETR) achieves 53.1% mean Average Precision (mAP) at Intersection over Union (IoU) at 0.5 : 0.95, You Only Look Once version 8 (YOLOv8) achieves 50.2% mAP at 0.5:0.95, real-time detectors exceed 100 frames per second (FPS) with competitive accuracy, and specialized infrared methods achieve 92.45% F-measure on NUAA-SIRST dataset. The work introduces a novel taxonomy of multi-modal fusion strategies, documents field-wide and review-specific limitations, and synthesizes recent 2024 to 2025 benchmarks across diverse datasets. Despite these advances, significant challenges remain in handling scale variation, occlusion effects, and domain adaptation. This survey outlines these persistent obstacles and promising research directions, providing a structured reference for researchers and practitioners.

## Full-text entities

- **Diseases:** RPN (MESH:D020918), aortic dissection (MESH:D000784), anatomical abnormalities (MESH:D020763)
- **Chemicals:** COCO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12932818/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12932818/full.md

## References

54 references — full list in the complete paper: https://tomesphere.com/paper/PMC12932818/full.md

---
Source: https://tomesphere.com/paper/PMC12932818