LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
Haoyu Zhao, Wenhang Ge, Ying-cong Chen

TL;DR
LLM-Optic leverages large language and multimodal models to significantly improve zero-shot visual grounding of complex queries without additional training, achieving state-of-the-art results.
Contribution
Introduces LLM-Optic, a novel framework that enhances visual grounding models with LLMs and LMMs for complex query understanding without extra training.
Findings
Achieves state-of-the-art zero-shot visual grounding performance.
Effectively interprets complex text queries involving multiple objects and spatial relationships.
Does not require additional training or fine-tuning.
Abstract
Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To overcome this limitation, we introduce LLM-Optic, an innovative method that utilizes Large Language Models (LLMs) as an optical lens to enhance existing visual grounding models in comprehending complex text queries involving intricate text structures, multiple objects, or object spatial relationships, situations that current models struggle with. LLM-Optic first employs an LLM as a Text Grounder to interpret complex text queries and accurately identify objects the user intends to locate. Then a pre-trained visual grounding model is used to generate candidate bounding boxes given the refined query by the Text Grounder. After that, LLM-Optic annotates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
