Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
Shivaen Ramshetty, Gaurav Verma, Srijan Kumar

TL;DR
This paper introduces cross-modal attribute insertions as a realistic perturbation method to evaluate the robustness of vision-and-language models, revealing significant performance drops and emphasizing the importance of multimodal data augmentation.
Contribution
It proposes a novel, controllable, and task-agnostic method for inserting visual attributes into text to assess model robustness in multimodal tasks.
Findings
State-of-the-art models' performance drops by 15-20% with cross-modal insertions.
Crowd-sourced annotations show higher quality augmentations than text-only methods.
The approach is modular, controllable, and applicable across different tasks.
Abstract
The robustness of multimodal deep learning models to realistic changes in the input text is critical for their applicability to important tasks such as text-to-image retrieval and cross-modal entailment. To measure robustness, several existing approaches edit the text data, but do so without leveraging the cross-modal information present in multimodal data. Information from the visual modality, such as color, size, and shape, provide additional attributes that users can include in their inputs. Thus, we propose cross-modal attribute insertions as a realistic perturbation strategy for vision-and-language data that inserts visual attributes of the objects in the image into the corresponding text (e.g., "girl on a chair" to "little girl on a wooden chair"). Our proposed approach for cross-modal attribute insertions is modular, controllable, and task-agnostic. We find that augmenting input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
