Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Qiyang Yu; Yu Fang; Tianrui Li; Xuemei Cao; Yan Chen; Jianghao Li; Fan Min

arXiv:2511.19021·cs.CV·November 25, 2025

Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min

PDF

Open Access

TL;DR

Grc-ViT introduces a dynamic, adaptive approach to vision transformers that adjusts visual granularity based on image complexity, improving fine-grained detail capture and efficiency.

Contribution

It proposes a novel coarse-to-fine framework with learnable parameters for adaptive granularity, surpassing fixed patch methods in accuracy and efficiency.

Findings

01

Improves fine-grained discrimination in vision transformers.

02

Achieves better accuracy-efficiency trade-off than fixed patch methods.

03

Demonstrates effectiveness across multiple image complexity scenarios.

Abstract

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual perception and processing mechanisms