ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping, Animesh Sinha, Markos Georgopoulos, Edgar Schoenfeld, Ji Hou, Felix Juefei-Xu, Sriram Vishwanath, Ali Thabet

TL;DR
ViTok-v2 is a large-scale, native resolution image autoencoder that improves reconstruction quality and stability through novel techniques, scaling up to 5 billion parameters and trained on 2 billion images.
Contribution
It introduces ViTok-v2 with native resolution support and a new perceptual loss, enabling stable training and scaling to 5B parameters for superior image reconstruction.
Findings
ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p.
It outperforms baselines at 512p and above.
Scaling both autoencoder and generator improves the reconstruction-generation trade-off.
Abstract
Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
