Not What You Asked For: Typographic Attacks in Household Robot Manipulation
Ali Iranmanesh, Peng Liu

TL;DR
This paper demonstrates that typographic attacks on vision-language models like CLIP can cause household robots to physically manipulate the wrong objects, revealing a significant safety vulnerability.
Contribution
It introduces a new evaluation of typographic attacks in a household robot manipulation pipeline and shows their real-world impact on robot safety.
Findings
Attack success rate of 67.8% under realistic conditions
Typographic errors lead to physical grasping of incorrect objects
Reveals safety risks in current robot manipulation pipelines
Abstract
Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
