Differentiable Hierarchical Visual Tokenization
Marius Aasan, Martine Hjelkrem-Tan, Nico Catalano, Changkyu Choi, Ad\'in Ram\'irez Rivera

TL;DR
This paper introduces a differentiable hierarchical visual tokenizer that adapts to image content at pixel-level granularity, enhancing Vision Transformers by capturing spatial and semantic structures for various vision tasks.
Contribution
It presents a novel end-to-end differentiable tokenizer that is compatible with existing models and improves performance across classification and dense-prediction tasks.
Findings
Competitive performance in image classification
Effective dense-prediction capabilities
Supports raster-to-vector conversion out-of-the-box
Abstract
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Face recognition and analysis
