LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge,, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

TL;DR
LLaVA-UHD is a large multimodal model capable of perceiving images at any aspect ratio and high resolution efficiently, overcoming limitations of fixed-size processing in existing models.
Contribution
The paper introduces LLaVA-UHD with a novel image modularization, compression, and spatial organization strategy, enabling high-resolution, aspect ratio flexible image understanding in multimodal models.
Findings
Outperforms larger models on 9 benchmarks
Supports 6 times larger images with less computation
Achieves 6.4% improvement on TextVQA
Abstract
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗openbmb/MiniCPM-V-4_5model· 106k dl· ♡ 1076106k dl♡ 1076
- 🤗openbmb/OmniLMM-12Bmodel· 127 dl· ♡ 72127 dl♡ 72
- 🤗Liuyq1995/test0320model· 2 dl2 dl
- 🤗openbmb/MiniCPM-V-2model· 78k dl· ♡ 49578k dl♡ 495
- 🤗DeclanBracken/MiniCPM-Llama3-V-2.5-Transcriptormodel· 3 dl3 dl
- 🤗SwordElucidator/MiniCPM-Llama3-V-2_5model· 5 dl5 dl
- 🤗DeclanBracken/MiniCPM-Llama3-V-2_5-Transcriptor-V3model· 1 dl1 dl
- 🤗seanlong/MiniCPM-Llama3-V-2_5model· 2 dl2 dl
- 🤗mao1207/MiniCPM-V-2-clonemodel· 5 dl5 dl
- 🤗compling/MiniCPM-V-2model· 29 dl· ♡ 129 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Medical Image Segmentation Techniques · CCD and CMOS Imaging Sensors
Methods1-Dimensional Convolutional Neural Networks
