FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O\u{g}uzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan

TL;DR
FlexTok introduces a flexible, variable-length 1D image tokenizer that adapts to image complexity, enabling efficient autoregressive image generation with high quality across different token counts.
Contribution
We propose FlexTok, a novel image tokenizer that produces variable-length 1D token sequences, allowing adaptive compression and improved generation quality.
Findings
Achieves FID<2 with 8 to 128 tokens on ImageNet
Outperforms TiTok and matches state-of-the-art with fewer tokens
Enables coarse-to-fine image description in token space
Abstract
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗EPFL-VILAB/flextok_vae_c4model· 17 dl17 dl
- 🤗EPFL-VILAB/flextok_vae_c8model· 4 dl4 dl
- 🤗EPFL-VILAB/flextok_vae_c16model· 10 dl10 dl
- 🤗EPFL-VILAB/flextok_d12_d12_in1kmodel· 29 dl29 dl
- 🤗EPFL-VILAB/flextok_d18_d18_in1kmodel· 5 dl5 dl
- 🤗EPFL-VILAB/flextok_d18_d28_in1kmodel· 48 dl48 dl
- 🤗EPFL-VILAB/flextok_d18_d28_dfnmodel· 5.3k dl· ♡ 15.3k dl♡ 1
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Medical Image Segmentation Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
