LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Ruyi Xu; Yuan Yao; Zonghao Guo; Junbo Cui; Zanlin Ni; Chunjiang Ge,; Tat-Seng Chua; Zhiyuan Liu; Maosong Sun; Gao Huang

arXiv:2403.11703·cs.CV·March 19, 2024·3 cites

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge,, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

PDF

Open Access 1 Repo 10 Models

TL;DR

LLaVA-UHD is a large multimodal model capable of perceiving images at any aspect ratio and high resolution efficiently, overcoming limitations of fixed-size processing in existing models.

Contribution

The paper introduces LLaVA-UHD with a novel image modularization, compression, and spatial organization strategy, enabling high-resolution, aspect ratio flexible image understanding in multimodal models.

Findings

01

Outperforms larger models on 9 benchmarks

02

Supports 6 times larger images with less computation

03

Achieves 6.4% improvement on TextVQA

Abstract

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/llava-uhd
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Medical Image Segmentation Techniques · CCD and CMOS Imaging Sensors

Methods1-Dimensional Convolutional Neural Networks