Laundering AI Authority with Adversarial Examples
Jie Zhang, Pura Peetathawatchai, Florian Tram\`er, Avital Shafran

TL;DR
This paper reveals that adversarial examples can subtly manipulate vision-language models to produce authoritative yet false responses, posing significant safety risks for AI trustworthiness.
Contribution
It demonstrates that standard adversarial attacks transfer to production VLMs, exposing vulnerabilities that can be exploited to spread misinformation and evade moderation.
Findings
High success rates (22-100%) in attacking multiple models.
Standard, decades-old attack techniques suffice for effective manipulation.
Attacks transfer reliably from publicly available models to commercial systems.
Abstract
Vision-language models (VLMs) are increasingly deployed as trusted authorities -- fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emph{AI authority laundering}: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emph{wrong} input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs -- including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
