Learning Visually-Grounded Semantics from Contrastive Adversarial Samples
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun

TL;DR
This paper enhances visual-semantic embeddings by augmenting datasets with contrastive adversarial samples, improving grounding accuracy and robustness against attacks, thus advancing the connection between textual semantics and visual concepts.
Contribution
It introduces a novel data augmentation method using linguistically-informed contrastive adversarial samples to improve visual-semantic grounding models.
Findings
Significant performance improvement on downstream tasks
Enhanced robustness against adversarial attacks
Better grounding of textual semantics to visual concepts
Abstract
We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using linguistic rules and the WordNet knowledge base. The construction procedure is both syntax- and semantics-aware. The samples enforce the model to ground learned embeddings to concrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
