Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig,, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid, Karlinsky

TL;DR
This paper enhances vision-language models' understanding of complex concepts like object attributes and relations by leveraging language structure, leading to significant improvements without additional data collection.
Contribution
It introduces a data-driven method that uses language structure understanding to improve VL models' grasp of structured vision&language concepts without extra datasets.
Findings
Up to 15% improvement in SVLC understanding.
Minimal impact on zero-shot performance.
Effective use of existing datasets for training.
Abstract
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
