AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene   Understanding

Yonghui Wang; Wengang Zhou; Hao Feng; Houqiang Li

arXiv:2408.16986·cs.CV·September 2, 2024

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

PDF

Open Access 1 Repo

TL;DR

AdaptVision introduces a dynamic input scaling method for multimodal large language models, enabling efficient and accurate scene understanding across diverse image resolutions and content types by adjusting visual tokens based on image complexity.

Contribution

We propose a novel dynamic image partitioning module that adjusts visual tokens according to image size and content, improving versatility and performance in vision-language tasks.

Findings

01

Effective processing of images up to 1008x1008 resolution.

02

Improved performance on vision-language tasks across natural and text-rich scenes.

03

Mitigation of distortion effects from image resizing.

Abstract

Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harrytea/adaptvision
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications