Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
Corentin Kervadec (LIRIS), Grigory Antipov, Moez Baccouche, Christian, Wolf (LIRIS)

TL;DR
This paper demonstrates that incorporating weak supervision for object-word alignment into vision-language models enhances their ability to learn fine-grained inter-modality relationships, leading to state-of-the-art results on VQA and image comparison tasks.
Contribution
The authors introduce an object-word alignment loss that improves inter-modality reasoning in vision-language models, surpassing previous state-of-the-art performances without additional fine-tuning.
Findings
Improved performance on VQA and NLVR2 datasets.
State-of-the-art results achieved without fine-tuning.
Enhanced attention alignment visualizations.
Abstract
The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
