BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li; Qianqian Xu; Shilong Bao; Zhiyong Yang; Xilin Zhao; Xiaochun Cao; Qingming Huang

arXiv:2603.05921·cs.CV·March 9, 2026

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xilin Zhao, Xiaochun Cao, Qingming Huang

PDF

Open Access

TL;DR

BlackMirror is a training-free, plug-and-play framework that detects backdoored text-to-image models by analyzing semantic deviations in generated images across varied prompts, effectively identifying diverse backdoor attacks.

Contribution

The paper introduces BlackMirror, a novel black-box detection framework that generalizes to diverse backdoor attacks by analyzing semantic pattern deviations without training.

Findings

01

BlackMirror achieves high detection accuracy across various backdoor attacks.

02

It effectively identifies backdoors even when generated images are visually diverse.

03

The framework is applicable in real-world Model-as-a-Service settings.

Abstract

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Generative Adversarial Networks and Image Synthesis