Differentiable Hierarchical Visual Tokenization

Marius Aasan; Martine Hjelkrem-Tan; Nico Catalano; Changkyu Choi; Ad\'in Ram\'irez Rivera

arXiv:2511.02652·cs.CV·November 5, 2025

Differentiable Hierarchical Visual Tokenization

Marius Aasan, Martine Hjelkrem-Tan, Nico Catalano, Changkyu Choi, Ad\'in Ram\'irez Rivera

PDF

Open Access 1 Video

TL;DR

This paper introduces a differentiable hierarchical visual tokenizer that adapts to image content at pixel-level granularity, enhancing Vision Transformers by capturing spatial and semantic structures for various vision tasks.

Contribution

It presents a novel end-to-end differentiable tokenizer that is compatible with existing models and improves performance across classification and dense-prediction tasks.

Findings

01

Competitive performance in image classification

02

Effective dense-prediction capabilities

03

Supports raster-to-vector conversion out-of-the-box

Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Differentiable Hierarchical Visual Tokenization· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Face recognition and analysis