Intelligent Image Search Algorithms Fusing Visual Large Models
Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li

TL;DR
This paper introduces DetVLM, a two-stage image retrieval framework combining object detection and visual large models to enable accurate, zero-shot, and state-specific image search, significantly improving retrieval accuracy in fine-grained tasks.
Contribution
The paper presents a novel two-stage framework that fuses YOLO detection with VLMs for enhanced fine-grained image retrieval, including zero-shot and state-specific search capabilities.
Findings
Achieves 94.82% retrieval accuracy on vehicle component dataset
Attains 94.95% accuracy in zero-shot driver mask detection
Over 90% accuracy in state-specific search tasks
Abstract
Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Multimodal Machine Learning Applications
