Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Melika Behjati; James Henderson

arXiv:2511.11262·cs.CV·November 17, 2025

Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Melika Behjati, James Henderson

PDF

Open Access

TL;DR

This paper introduces a model that groups caption tokens to better align language with visual objects, enhancing fine-grained understanding in vision-language models and discovering meaningful, groundable phrases.

Contribution

It proposes a novel token grouping approach that improves fine-grained vision-language understanding and aligns language groups with visual objects.

Findings

01

Token grouping improves fine-grained alignment

02

Discovered groups match groundable phrases

03

Enhanced understanding of scene details

Abstract

Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques