GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

Guanghao Zheng; Bowen Shi; Mingxing Xu; Ruoyu Sun; Peisen Zhao; Zhibo Zhang; Wenrui Dai; Junni Zou; Hongkai Xiong; Xiaopeng Zhang; Qi Tian

arXiv:2510.21501·cs.CV·October 27, 2025

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

Guanghao Zheng, Bowen Shi, Mingxing Xu, Ruoyu Sun, Peisen Zhao, Zhibo Zhang, Wenrui Dai, Junni Zou, Hongkai Xiong, Xiaopeng Zhang, Qi Tian

PDF

Open Access

TL;DR

GranViT introduces a fine-grained vision transformer trained on a large annotated dataset, enhancing regional perception and reasoning in multi-modal large language models, leading to state-of-the-art results in various vision-language tasks.

Contribution

The paper presents GranViT, a novel vision transformer with a large-scale fine-grained pretraining dataset and a new training framework that improves regional perception in MLLMs.

Findings

01

Outperforms existing vision encoders in fine-grained recognition.

02

Achieves state-of-the-art results in multimodal VQA.

03

Enhances OCR understanding and localization capabilities.

Abstract

Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning