Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh

TL;DR
This paper introduces Show-and-Fool, an algorithm for generating adversarial examples that can mislead neural image captioning systems, revealing their vulnerabilities and providing insights into the robustness of visual language grounding.
Contribution
The paper presents a novel adversarial attack method for neural image captioning, demonstrating high transferability and exposing robustness issues in current models.
Findings
Adversarial examples can successfully mislead captioning systems
Generated adversarial images are highly transferable across models
The approach reveals significant robustness vulnerabilities
Abstract
Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check whether neural image captioning systems can be mislead to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
