LLM-Optic: Unveiling the Capabilities of Large Language Models for   Universal Visual Grounding

Haoyu Zhao; Wenhang Ge; Ying-cong Chen

arXiv:2405.17104·cs.CV·May 29, 2024·1 cites

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Haoyu Zhao, Wenhang Ge, Ying-cong Chen

PDF

Open Access

TL;DR

LLM-Optic leverages large language and multimodal models to significantly improve zero-shot visual grounding of complex queries without additional training, achieving state-of-the-art results.

Contribution

Introduces LLM-Optic, a novel framework that enhances visual grounding models with LLMs and LMMs for complex query understanding without extra training.

Findings

01

Achieves state-of-the-art zero-shot visual grounding performance.

02

Effectively interprets complex text queries involving multiple objects and spatial relationships.

03

Does not require additional training or fine-tuning.

Abstract

Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objects, or object spatial relationships, situations that current models struggle with. LLM-Optic first employs an LLM as a Text Grounder to interpret complex text queries and accurately identify objects the user intends to locate. Then a pre-trained visual grounding model is used to generate candidate bounding boxes given the refined query by the Text Grounder. After that, LLM-Optic annotates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications