HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang

TL;DR
HiRes-LLaVA introduces a novel framework that effectively restores fragmented high-resolution visual inputs in vision-language models, improving performance on detailed and position-sensitive tasks without increasing training costs.
Contribution
The paper proposes HiRes-LLaVA, a new method with SliceRestore and Self-Mining components to process high-resolution images without losing contextual information, addressing input fragmentation issues.
Findings
Outperforms existing models on public benchmarks.
Achieves state-of-the-art results on EntityGrid-QA.
Effectively handles document-oriented and position-sensitive tasks.
Abstract
High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The problem of high-resolution images encoding is very important for Large Vision-Language Models. A new method and benchmark are proposed to improve the accuracy and enrich the evaluation datasets, which would be interesting to the LVLM community 2. The two new modules, SliceRestore adapter and Self-Mining Sampler, are both well-motivated, and clearly-designed to improve the vision encoding of cross-patch regions, and improve the training efficiency for visual token compression. 3. The propo
The major concerns is about the novelty of proposed modules, and insufficient literature review on similar approaches. (1) The key idea of SliceRestore adapter is to introduce the local and global fusion operations for multiple patches. However, this problem has been well studies in the vision-transformer paper before, like swin-transformer [1], PVT [2], Twins [3] and many others. The key of these work is also how to improve the communication of multiple small patch during multiple layer of feat
1. The idea sounds and the paper is easy to follow. 2. Figures 2 and 3 are helpful for understanding. 3. Resolving fragment problem in the current dynamic slicing strategy for high-resolution input is important and valuable.
1. Table 3 does not provide the performance metrics for the SRA alone, SMS + SRA (L) 2. The main experiments is based on LLaVA-224, which is kind of outdate. It may be better and convicine to use LLaVA-1.5-336 as the baseline. 3. No computation cost and parameter flops analysis. It may be more convicine to provide a comparison between inference time, accuracy and number of tokens on different methods. 4. Please refer to my questions.
1. The proposed self-mining sampler demonstrates higher data efficiency compared to Q-former and enhances the model's performance. 2. The newly introduced EntityGrid-QA serves as a valuable synthetic benchmark, offering a more comprehensive evaluation of MLLMs in handling information at the fragmentation boundary. 3. The model has a sota performance on various tasks including doc related tasks and science related tasks.
1. My primary concern lies with the contribution of the SliceRestore adapter (SRA), which is intended to address the fragmentation issue. Based on the ablation study, it appears that the main performance improvement stems from the self-mining sampler (SMS) rather than the SRA. This raises uncertainty about the actual utility of the SRA; if it proves ineffective, the fragmentation issue that this paper aims to address remains unresolved. 2. My second concern is the fairness of the comparison bet
1. Theoretically, The method proposed in this paper can handle outputs of any resolution and shows a fast convergence in training and significant performance improvement, outperforming the baseline on 8 benchmark results. 2. A new and simple benchmark to evaluate the model for handling fragmented inputs is proposed, demonstrating the effectiveness of the SRA module in this paper. 3. The ablation experiments are comprehensive, showing the effectiveness of the proposed method across multiple model
1. The experimental setup description is unclear. Section 4.1 mentions a 3-stage training process, but section 4.3 gives out a LoRA finetuning is applied, and also claims the model is trained from scratch in comparison with Monkey.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Medical Imaging and Analysis · Natural Language Processing Techniques
MethodsAdapter · High-resolution input · Convolution · Fragmentation
