FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models
Jaechul Roh, Andrew Yuan, Jinsong Mao

TL;DR
FameBias is a novel attack method that manipulates input embeddings in text-to-image models to generate images of specific public figures, raising concerns about bias and misuse without retraining the models.
Contribution
This paper introduces FameBias, a new input embedding manipulation technique for T2I models that does not require additional training, enabling targeted biasing of generated images.
Findings
High attack success rate in generating targeted images.
Preserves semantic meaning of prompts.
Effective across multiple trigger-target pairs.
Abstract
Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the generation of high-quality images that align closely with textual descriptions. However, this progress has also raised concerns about their misuse for propaganda and other malicious activities. Recent studies reveal that attackers can embed biases into these models through simple fine-tuning, causing them to generate targeted imagery when triggered by specific phrases. This underscores the potential for T2I models to act as tools for disseminating propaganda, producing images aligned with an attacker's objective for end-users. Building on this concept, we introduce FameBias, a T2I biasing attack that manipulates the embeddings of input prompts to generate images featuring specific public figures. Unlike prior methods, Famebias operates solely on the input embedding vectors without requiring additional model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection
MethodsDiffusion · ALIGN
