Not What You Asked For: Typographic Attacks in Household Robot Manipulation

Ali Iranmanesh; Peng Liu

arXiv:2605.18593·cs.CR·May 19, 2026

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

Ali Iranmanesh, Peng Liu

PDF

TL;DR

This paper demonstrates that typographic attacks on vision-language models like CLIP can cause household robots to physically manipulate the wrong objects, revealing a significant safety vulnerability.

Contribution

It introduces a new evaluation of typographic attacks in a household robot manipulation pipeline and shows their real-world impact on robot safety.

Findings

01

Attack success rate of 67.8% under realistic conditions

02

Typographic errors lead to physical grasping of incorrect objects

03

Reveals safety risks in current robot manipulation pipelines

Abstract

Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.