On the Adversarial Robustness of Discrete Image Tokenizers
Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion, Francesco Croce

TL;DR
This paper investigates the vulnerability of discrete image tokenizers to adversarial attacks and proposes an unsupervised adversarial training method to enhance their robustness across various multimodal tasks.
Contribution
First, it formulates effective, application-agnostic adversarial attacks on discrete tokenizers; second, it introduces an unsupervised adversarial training approach to improve robustness.
Findings
Attacks effectively perturb tokenizer features across tasks.
Unsupervised adversarial training enhances robustness significantly.
Method generalizes well to unseen data and tasks.
Abstract
Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach…
Peer Reviews
Decision·Submitted to ICLR 2026
- Novel problem definition: First dedicated work on adversarial robustness for discrete image tokenizers, an important but overlooked component in multimodal systems. - Well-motivated and efficient method: Attack design is simple, label-free, and computationally less expensive compared to end-to-end attacks. - Strong empirical coverage: Extensive experiments across diverse downstream tasks and datasets (classification, retrieval, VQA, captioning) validate both attacks and defenses.
- The attack and defense methods employed are effective but relatively classical (APGD, standard adversarial training) - The study currently focuses on a small set of tokenizer architectures (TiTok, UniTok). Exploring a wider variety of tokenizer designs — such as different quantization schemes (VQ vs FSQ), codebook sizes, number of tokens, or hybrid architectures — would help assess whether the proposed defense generalizes across structural variations and reveal design factors influencing robus
- **Problem Motivation and Scope:** The paper is well-motivated, addressing a critical yet underexplored vulnerability in multimodal foundation models. The focus on discrete image tokenizers—now ubiquitous in modern vision and vision-language pipelines—is both timely and highly relevant. - **General, Task-Agnostic Defense:** The work introduces an unsupervised adversarial fine-tuning strategy that operates entirely at the tokenizer level and requires only unlabeled images. The resulti
1. **Minor Grammatical and Typographical Errors:** - “captioninig” → “captioning” (Section 4.2, “VQA and captioninig tasks”) - “severaly degraded” → “severely degraded” (Table 4 discussion, “the resulting clean performance is several[y] degraded”) - “unsupervsied” → “unsupervised” (Discussion section, “improve robustness against unsupervsied and end-to-end supervised attacks”) - “imputs” → “inputs” (Related work section, “extend masked modeling losses to visual imputs”) 2. *
- This is the first work to systematically study the adversarial robustness of discrete image tokenizers. - The experimental results demonstrate the effectiveness of the proposed unsupervised adversarial fine-tuning. - The defense method only requires fine-tuning the tokenizer's encoder, while keeping the much larger downstream components like LLMs frozen.
- As shown in Figure 1, the unsupervised attack performs comparably to, or even weaker than, standard end-to-end supervised attacks, especially at small $\epsilon$ values. The authors do not demonstrate that this attack uncovers new vulnerabilities that supervised attacks miss. I didn't see the motivation. Is the only reason unsupervison? The authors claim the attack is 'unsupervised' and 'task-agnostic'. However, this property is not unique to tokenizer models. Unsupervised attacks on the embed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
