Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang; Tingqiong Cui; Yang Zhang; Yu Chen; Shifeng Wu; Zhenzhang Li

arXiv:2511.19920·cs.CV·November 26, 2025

Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li

PDF

Open Access

TL;DR

This paper introduces DetVLM, a two-stage image retrieval framework combining object detection and visual large models to enable accurate, zero-shot, and state-specific image search, significantly improving retrieval accuracy in fine-grained tasks.

Contribution

The paper presents a novel two-stage framework that fuses YOLO detection with VLMs for enhanced fine-grained image retrieval, including zero-shot and state-specific search capabilities.

Findings

01

Achieves 94.82% retrieval accuracy on vehicle component dataset

02

Attains 94.95% accuracy in zero-shot driver mask detection

03

Over 90% accuracy in state-specific search tasks

Abstract

Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Multimodal Machine Learning Applications