Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

Ranjan Sapkota; Manoj Karkee

arXiv:2508.19294·cs.CV·October 1, 2025

Object Detection with Multimodal Large Vision-Language Models: An In-depth Review

Ranjan Sapkota, Manoj Karkee

PDF

TL;DR

This review comprehensively explores how large vision-language models (LVLMs) are transforming object detection by integrating visual and textual data, highlighting recent innovations, challenges, and future prospects in the field.

Contribution

It systematically analyzes recent LVLM architectures and training methods for object detection, providing a structured overview of advancements and identifying future research directions.

Findings

01

LVLMs enhance object detection accuracy and contextual understanding.

02

They demonstrate superior adaptability and real-time performance compared to traditional methods.

03

The review identifies key limitations and proposes potential solutions for LVLM development.

Abstract

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.