Laundering AI Authority with Adversarial Examples

Jie Zhang; Pura Peetathawatchai; Florian Tram\`er; Avital Shafran

arXiv:2605.04261·cs.CR·May 7, 2026

Laundering AI Authority with Adversarial Examples

Jie Zhang, Pura Peetathawatchai, Florian Tram\`er, Avital Shafran

PDF

TL;DR

This paper reveals that adversarial examples can subtly manipulate vision-language models to produce authoritative yet false responses, posing significant safety risks for AI trustworthiness.

Contribution

It demonstrates that standard adversarial attacks transfer to production VLMs, exposing vulnerabilities that can be exploited to spread misinformation and evade moderation.

Findings

01

High success rates (22-100%) in attacking multiple models.

02

Standard, decades-old attack techniques suffice for effective manipulation.

03

Attacks transfer reliably from publicly available models to commercial systems.

Abstract

Vision-language models (VLMs) are increasingly deployed as trusted authorities -- fact-checking images on social media, comparing products, and moderating content. Users implicitly trust that these systems perceive the same visual content as they do. We show that adversarial examples break this assumption, enabling \emph{AI authority laundering}: an attacker subtly perturbs an image so that the VLM produces confident and authoritative responses about the \emph{wrong} input. Unlike jailbreaks or prompt injections, our attacks do not compromise model alignment; the attack operates entirely at the perceptual level. We demonstrate that standard attacks against publicly available CLIP models transfer reliably to production VLMs -- including GPT-5.4, Claude Opus~4.6, Gemini~3, and Grok~4.2. Across four attack surfaces, we show that authority laundering can amplify misinformation, disparage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.