Ferret: Refer and Ground Anything Anywhere at Any Granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui, Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

TL;DR
Ferret is a multimodal large language model that understands and grounds spatial references of any shape or granularity within images, advancing multimodal understanding and grounding capabilities.
Contribution
Ferret introduces a hybrid region representation and a spatial-aware visual sampler, enabling flexible region inputs and improved grounding performance in an LLM framework.
Findings
Outperforms existing models in referring and grounding tasks.
Significantly reduces object hallucination.
Enhances multimodal chatting with detailed image descriptions.
Abstract
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with…
Peer Reviews
Decision·ICLR 2024 spotlight
S1. This seems to be one of the first MLLMs to support a variety of visual reference types, such as point, box, scribble, polygons, and masks. S2. The authors provide a curated dataset called GRIT that consists of existing datasets and newly collected data for training MLLMs with visual referring and grounding capabilities. S3. The authors provide a new benchmark, Ferret-Bench, which covers two new types of evaluation task for visual referencing (description and reasoning) in addition to the
W1. The paper omits any discussion on the limitations or potential failure scenarios of the proposed method. W2. The significance of the proposed Spatial-Aware Visual Sampler is minimal. The idea of sampling the visual features over the grid is in the same spirit as the Visual Sampler in SEEM (Zou et al., 2023), although the details of how the points features are aggregated and pooled are different. Performance-wise, the Spatial-Aware Visual Sampler is shown to be only marginally better than th
1. The paper is presented very well. 2. The paper shows a reasonable motivation that humans inherently possess the ability to learn from one task and generalize to another between referring and grounding. This underscores the essential need to unify referring and grounding processes. 3. The hybrid region representation and spatial-aware visual sampler make the framework flexible to take different form of region definition. 4. The framework shows a good way of utilization of Large Language Model
1. No open source code for the code and dataset. I would raise the soundness score if code and dataset are open, either attached in the supplementary or released in the public repo. 2. The hierarchy of the dataset is a bit complicated. This may not be practical for costume dataset. 3. Very engineering paper, extensive work, but not much scientific novelty.
1. The proposed GRIT dataset is meaningful to the vision and language research. 2. The proposed Spatial-Aware Visual Sampler and Hybrid Region Representation are well-motivated. 3. The experiment results show the better capabilities of the trained model on multiple referring and grounding tasks and validate the effectiveness of the spatial-aware visual sampler module.
1. The ablation on hybrid region representation is missing. 2. Not a strong weakness, but whether the model performs well on non-referring or grounding tasks needs more validation. E.g. VQA_v2, MME, general captioning, etc. And it seems the caption evaluation is not as good as InstructBLIP.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems
