GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang; Peize Sun; Shoufa Chen; Min Xiao; Wenqi Shao; Wenwei Zhang; Yu Liu; Kai Chen; Ping Luo

arXiv:2307.03601·cs.CV·June 13, 2025·32 cites

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo

PDF

Open Access 3 Repos 3 Reviews

TL;DR

GPT4RoI enhances vision-language models by incorporating region-of-interest references through spatial instruction tuning, enabling interactive, fine-grained multimodal understanding and reasoning, significantly improving performance on visual commonsense tasks.

Contribution

Introduces spatial instruction tuning with region-of-interest references, enabling flexible interaction and detailed attribute reasoning in large language models.

Findings

01

Achieves 81.6% accuracy on VCR dataset, surpassing existing models.

02

Enables interaction via language and drawing bounding boxes.

03

Models can reason about multiple RoIs using common sense.

Abstract

Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. Both qualitative and quantitative results demonstrate that now the model can have a sense of location. 2. The presentation is clear. 3. The figures are easy to read.

Weaknesses

1. Which part of the model design leads to positional awareness is unclear. Authors have " five lightweight scale shuffle modules", "ROI Align", "add feature coordinates (Liu et al., 2018) for each level (positional embedding)", "extract region-level features with the output size of 14×14", which part really makes the model work? There is no ablation study. 2. Finetuning on a specific dataset can lead to the case that the model forgets all other knowledge. For example, fine-tuning on the multic

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

- Fine-grained multimodal understanding: GPT4RoI enables region-level alignment and understanding by incorporating references to RoIs in instructions, allowing for more detailed analysis and reasoning. - Interactive user experience: Users can interact with GPT4RoI through both language input and drawing bounding boxes. - GPT4RoI achieves remarkable accuracy on the VCR dataset, surpassing existing models by a significant margin.

Weaknesses

- Expanding from image-level to region-level instruction tuning seems like a natural progression, and the approach is straightforward without providing a fresh perspective. Some other papers also explore the region-level large language models [1] but lack the performance comparison. - It appears that while this paper utilized more datasets for training, the improvement in results is relatively marginal, as shown in Table 5. - This work lacks a comparison of parameters. The current models seem

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The paper proposed a novel method with vision language tasks. It provides a detailed methodology for this. - A comprehensive discussion of experiments and results, where the figures are in good quality and readability. - The benchmark methods are of good quality.

Weaknesses

After Rebuttal: Upon re-reviewing the manuscript and checking other fellow reviewers' comments, I have identified several major concerns that I previously overlooked. - My biggest concern: the model is evaluated on the Visual Genome, Visual-7W, and VCR datasets. However, if I understand correctly, the model has been pre-trained on the Visual Genome and VCR datasets. I am therefore concerned that the model's strong performance on these two datasets is a case of overfitting, especially given that

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning