DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models
Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin

TL;DR
DART introduces a differentiable, content-aware region tokenizer that adaptively creates variable-sized patches, significantly improving efficiency and performance in vision models by focusing on information-rich regions.
Contribution
It proposes a novel differentiable adaptive region tokenizer that dynamically allocates tokens based on content, enhancing model efficiency and accuracy across vision tasks.
Findings
DART matches larger models' performance with fewer parameters.
It improves inference speed by focusing on high-resolution regions.
The approach benefits dense prediction and spatiotemporal video tasks.
Abstract
The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The proposed quantile inversion provides continuous, differentiable boundaries. Additionally, DART-Flow lets tokens “flow” globally to salient zones while preserving a fixed token budget, so batches and token counts remain consistent. This provides computational benefits compared to fully adaptive methods, notably with some loss in flexibility. 2. Similarly to recent work [6], DART can serve as a drop-in for uniform backbones without altering the backbone. Reminiscent to kernelled positional
1. A central concern of this reviewer is that the related work section is quite underdeveloped. There has been much work done in this field, and the omission of many concurrent works in adaptive tokenization [1,2,3,4,5,6] overstates the conceptual novelty of the approach. 2. The claimed “adaptive pre-tokenization” is functionally limited. The adaptivity of regions is limited, the token count is fixed and the global token budget predetermined. While this is beneficial for efficiency (which is the
- The paper is well written and contains many clear examples, results and ablations to support their design considerations. - The experiments are conducted on a variety of scales, datasets, scenarios, and model types, suggesting the strong performance of their proposed approach. - The proposed method is largely novel, adaptive, and scalable, and serves as a significant future direction when working with visual understanding and generation methods
- The authors suggest that their adaptive partitioning and scaling of sequence length can lead to smaller models matching the performance of their larger variants at cheaper resource allocations. Does this hold for denser tasks like object detection and semantic segmentation as well -- or is this limited to simpler classification problems as presented? - Figure 6 suggests a good handling on increasing image resolution -- can this be extended further for tasks like superresolution with constraine
1. The figures are visually clear and aesthetically pleasing, and the narrative is easy to follow, with ideas presented in a straightforward manner. 2. The experimental details are comprehensive and well-documented. 3. The overall method is technically sound. The motivation is reasonable.
1. The paper lacks a formalized mathematical description of the proposed method, and it would benefit from a detailed architectural diagram of the entire module. Additionally, the implementation relies heavily on existing, well-established components, which could be seen as limiting its originality. 2. The number of baselines used to validate the effectiveness of the proposed module is insufficient, making it difficult to demonstrate its superiority conclusively. 3. In Table 4, the introductio
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Advanced Neural Network Applications
MethodsAbsolute Position Encodings · Attention Dropout · Byte Pair Encoding · Label Smoothing · Softmax · Linear Layer · Feedforward Network · Dropout · Dense Connections · Transformer
