Attacking Visual Language Grounding with Adversarial Examples: A Case   Study on Neural Image Captioning

Hongge Chen; Huan Zhang; Pin-Yu Chen; Jinfeng Yi; Cho-Jui Hsieh

arXiv:1712.02051·cs.CV·May 23, 2018·20 cites

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Cho-Jui Hsieh

PDF

Open Access 2 Repos

TL;DR

This paper introduces Show-and-Fool, an algorithm for generating adversarial examples that can mislead neural image captioning systems, revealing their vulnerabilities and providing insights into the robustness of visual language grounding.

Contribution

The paper presents a novel adversarial attack method for neural image captioning, demonstrating high transferability and exposing robustness issues in current models.

Findings

01

Adversarial examples can successfully mislead captioning systems

02

Generated adversarial images are highly transferable across models

03

The approach reveals significant robustness vulnerabilities

Abstract

Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check whether neural image captioning systems can be mislead to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning