All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni; Niloufar Mehrabi; Hazim Alzorgan; Mahlagha Fazeli; Abolfazl Razi

arXiv:2510.26641·cs.CV·April 13, 2026

All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, Abolfazl Razi

PDF

TL;DR

This survey analyzes the latest advances in object detection for autonomous vehicles, emphasizing multimodal sensor fusion, emerging AI paradigms like VLMs and LLMs, and future research directions.

Contribution

It provides a comprehensive review of sensor fusion techniques, datasets, and cutting-edge detection methods, focusing on integrating recent AI models in autonomous vehicle perception.

Findings

01

Sensor fusion strategies vary in capabilities and limitations in dynamic environments.

02

Emerging transformer-based detection approaches leverage Vision Transformers and multimodal models.

03

The survey identifies open challenges and future opportunities in multimodal perception for AVs.

Abstract

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.