MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai, Chen, Hua Yang

TL;DR
MG-LLaVA introduces a multi-granularity visual instruction tuning approach for multi-modal large language models, enhancing detailed visual perception by integrating multi-resolution features and object-centric information, trained solely on public data.
Contribution
It proposes a novel multi-granularity vision flow with high-resolution and object-level features, improving visual understanding in large language models.
Findings
Outperforms existing models of similar size on multiple benchmarks.
Effectively captures fine-grained visual details and object recognition.
Demonstrates strong perception skills without proprietary data.
Abstract
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is well-organized. The proposed model is evaluated on multiple tasks including general visual understanding benchmarks, VQA, and video datasets. Ablation study and runtime evaluation are also provided.
- The technical contribution of the paper is not very significant. The paper claims the main contribution is combining low, high-resolution, and object-level features. But the design of combining low and high-resolution features mainly comes from mini-Gemini and some modifications on the fusion module are proposed in the paper. The introduction of object-level features requires extra models and makes the base architecture more complex. - I am not convinced about the necessity of introducing th
1. The key claim for the paper that multi-granularity features with low-res, high-res, and object features can improve detailed understanding and object recognition skills is reasonable. The authors design the conv-gated fusing module and demonstrate its effectiveness through complete ablation studies. 2. The series of models and benchmarks are clear and complete. The authors train the variants for MG-LLaVA based on Phi, Vicuna, LLaMA3, and Yi1.5 and conduct experiments on various multi-modal be
1. The idea of fusing multi-granularity features is not novel, as integrating low-resolution and high-resolution images has been demonstrated effect by a range of works, including LLaVA-NeXt, LLaVA-HR, Mini-Gemini, LLaVA-UHD, etc. The difference in MG-LLaVA lies in the usage of detected objects. However, the detection operation introduces extra computational costs and external models with extra information, which is not an optimal solution. 2. The performance comparisons against existing MLLMs
1. The integration and fusion of multi-granularity features with object-centric features is novel for MLLMs. 2. Experimental results demonstrate the effectiveness of the proposed pipeline. 3. The paper is well-written and clearly presented.
1. The performance improvement on similarly sized LLMs in Table 2 and Table 3 appears modest. 2. The ablation study would benefit from visual comparisons to illustrate the impact of each component, such as case studies or visualizations of feature-level effects. 3. Some failure cases should be shown to provide insights into the method’s limitations. 4. It is unclear if the method can handle larger images, such as 1024p or 2k resolutions.
The goal of this paper is to release the power of MLMs on fine-grained tasks. A high resolution visual encoder is introduced to make up for the complement of previous work. And some fusion and compression strategies are introduced to ease the computational pressure. In addition to this, the article demonstrates that this new framework achieves significantly higher scores on MLMs at several scales, which fully demonstrates the effectiveness of the method. Moreover, this is the first approach to i
1. As mentioned in the article itself, the introduction of multi-granularity and multi-scale to enhance model performance is a common approach to convolutional networks, and merely migrating this approach to the field of MLMs is hardly an innovative contribution. Some of the algorithms used in the article from object detection only do some information enhancement on the input side, while many MLMs can already accomplish the object detection task by themselves nowadays. 2. The scores achieved on
1. The structure of paper is simple and easy to read, and the model implementation is very easy to follow. 2. The idea is very straightforward, and the experiments are solid. It is reasonable to introduce multi-granularity object-level features to enhance the perceptual capabilities of Multimodal Large Language Models (MLLMs).
1. The idea appears incremental, as it simply integrates high-resolution image interpretation with region-level image understanding, resembling a trick 2. Experimental evaluations and fair comparisons are notably lacking. Given that multi-granularity features are utilized to augment the model's perceptual abilities, evaluations should be conducted on fine-grained perception datasets. General VQA is inadequate for assessing the fine-grained perceptual capabilities of MLLM. 3. Excessive reliance
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsBalanced Selection
