A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

Guiying Zhu; Bowen Yang; Yin Zhuang; Tong Zhang; Guanqun Wang; Zhihao Che; He Chen; Lianlin Li

arXiv:2601.11910·cs.CV·January 22, 2026

A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

Guiying Zhu, Bowen Yang, Yin Zhuang, Tong Zhang, Guanqun Wang, Zhihao Che, He Chen, Lianlin Li

PDF

Open Access

TL;DR

This paper introduces GW-VLM, a training-free vision-language model that achieves open-vocabulary object detection by combining multi-scale visual-language searching with contextual prompts, without any additional training.

Contribution

The paper proposes a novel training-free approach for OVOD using MS-VLS and CCP, enabling universal object understanding with pre-trained models without training.

Findings

01

Achieves superior OVOD performance on multiple datasets

02

Operates without any training step

03

Effective multi-scale visual-language alignment

Abstract

Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques