Griffon: Spelling out All Object Locations at Any Granularity with Large   Language Models

Yufei Zhan; Yousong Zhu; Zhiyang Chen; Fan Yang; Ming Tang; Jinqiao; Wang

arXiv:2311.14552·cs.CV·October 10, 2024·1 cites

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao, Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Griffon leverages large vision-language models to accurately locate objects at various granularities without specialized modules, advancing fine-grained object perception and surpassing previous models on multiple benchmarks.

Contribution

Introducing Griffon, a novel LVLM-based approach that unifies data formats and is trained end-to-end, enabling precise object localization without additional detection modules or expert models.

Findings

01

Achieves state-of-the-art on RefCOCO and Flickr30K Entities

02

Approaches Faster RCNN performance on MSCOCO detection

03

Demonstrates LVLMs' basic object perception capabilities

Abstract

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jefferyzhan/griffon
pytorchOfficial

Datasets

JefferyZhan/Language-prompted-Localization-Dataset
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques