GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo

TL;DR
GPT4RoI enhances vision-language models by incorporating region-of-interest references through spatial instruction tuning, enabling interactive, fine-grained multimodal understanding and reasoning, significantly improving performance on visual commonsense tasks.
Contribution
Introduces spatial instruction tuning with region-of-interest references, enabling flexible interaction and detailed attribute reasoning in large language models.
Findings
Achieves 81.6% accuracy on VCR dataset, surpassing existing models.
Enables interaction via language and drawing bounding boxes.
Models can reason about multiple RoIs using common sense.
Abstract
Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each…
Peer Reviews
Decision·Submitted to ICLR 2024
1. Both qualitative and quantitative results demonstrate that now the model can have a sense of location. 2. The presentation is clear. 3. The figures are easy to read.
1. Which part of the model design leads to positional awareness is unclear. Authors have " five lightweight scale shuffle modules", "ROI Align", "add feature coordinates (Liu et al., 2018) for each level (positional embedding)", "extract region-level features with the output size of 14×14", which part really makes the model work? There is no ablation study. 2. Finetuning on a specific dataset can lead to the case that the model forgets all other knowledge. For example, fine-tuning on the multic
- Fine-grained multimodal understanding: GPT4RoI enables region-level alignment and understanding by incorporating references to RoIs in instructions, allowing for more detailed analysis and reasoning. - Interactive user experience: Users can interact with GPT4RoI through both language input and drawing bounding boxes. - GPT4RoI achieves remarkable accuracy on the VCR dataset, surpassing existing models by a significant margin.
- Expanding from image-level to region-level instruction tuning seems like a natural progression, and the approach is straightforward without providing a fresh perspective. Some other papers also explore the region-level large language models [1] but lack the performance comparison. - It appears that while this paper utilized more datasets for training, the improvement in results is relatively marginal, as shown in Table 5. - This work lacks a comparison of parameters. The current models seem
- The paper proposed a novel method with vision language tasks. It provides a detailed methodology for this. - A comprehensive discussion of experiments and results, where the figures are in good quality and readability. - The benchmark methods are of good quality.
After Rebuttal: Upon re-reviewing the manuscript and checking other fellow reviewers' comments, I have identified several major concerns that I previously overlooked. - My biggest concern: the model is evaluated on the Visual Genome, Visual-7W, and VCR datasets. However, if I understand correctly, the model has been pre-trained on the Visual Genome and VCR datasets. I am therefore concerned that the model's strong performance on these two datasets is a case of overfitting, especially given that
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
