Cross-Modal Attribute Insertions for Assessing the Robustness of   Vision-and-Language Learning

Shivaen Ramshetty; Gaurav Verma; Srijan Kumar

arXiv:2306.11065·cs.CL·June 21, 2023·1 cites

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

Shivaen Ramshetty, Gaurav Verma, Srijan Kumar

PDF

Open Access 1 Repo

TL;DR

This paper introduces cross-modal attribute insertions as a realistic perturbation method to evaluate the robustness of vision-and-language models, revealing significant performance drops and emphasizing the importance of multimodal data augmentation.

Contribution

It proposes a novel, controllable, and task-agnostic method for inserting visual attributes into text to assess model robustness in multimodal tasks.

Findings

01

State-of-the-art models' performance drops by 15-20% with cross-modal insertions.

02

Crowd-sourced annotations show higher quality augmentations than text-only methods.

03

The approach is modular, controllable, and applicable across different tasks.

Abstract

The robustness of multimodal deep learning models to realistic changes in the input text is critical for their applicability to important tasks such as text-to-image retrieval and cross-modal entailment. To measure robustness, several existing approaches edit the text data, but do so without leveraging the cross-modal information present in multimodal data. Information from the visual modality, such as color, size, and shape, provide additional attributes that users can include in their inputs. Thus, we propose cross-modal attribute insertions as a realistic perturbation strategy for vision-and-language data that inserts visual attributes of the objects in the image into the corresponding text (e.g., "girl on a chair" to "little girl on a wooden chair"). Our proposed approach for cross-modal attribute insertions is modular, controllable, and task-agnostic. We find that augmenting input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

claws-lab/multimodal-robustness-xmai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling