LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang,, Xiaoying Zhang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng, Chua, Maosong Sun

TL;DR
LLaVA-UHD v2 introduces a hierarchical window transformer with a semantic pyramid to improve fine-grained visual perception in multimodal large language models, significantly enhancing performance across multiple benchmarks.
Contribution
The paper proposes a novel hierarchical window transformer with a semantic pyramid, enabling better multi-scale visual feature integration in MLLMs, which is a new approach in this domain.
Findings
Outperforms baseline models on 14 benchmarks with an average boost of 3.7%.
Achieves 9.3% improvement on DocVQA.
Demonstrates enhanced fine-grained visual perception capabilities.
Abstract
Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the limitations of ViTs in capturing diverse multi-modal visual levels, such as low-level details. To address this issue, we present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector, the Hierarchical window (Hiwin) transformer. Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid. Specifically, Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Anomaly Detection Techniques and Applications · Time Series Analysis and Forecasting
MethodsSparse Evolutionary Training
