Discovering Meaningful Units with Visually Grounded Semantics from Image Captions
Melika Behjati, James Henderson

TL;DR
This paper introduces a model that groups caption tokens to better align language with visual objects, enhancing fine-grained understanding in vision-language models and discovering meaningful, groundable phrases.
Contribution
It proposes a novel token grouping approach that improves fine-grained vision-language understanding and aligns language groups with visual objects.
Findings
Token grouping improves fine-grained alignment
Discovered groups match groundable phrases
Enhanced understanding of scene details
Abstract
Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
