See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Boyuan Sun; Bowen Yin; Yuanming Li; Xihan Wei; Qibin Hou

arXiv:2605.18018·cs.CV·May 19, 2026

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

PDF

1 Repo 1 Models 1 Datasets

TL;DR

SWIM is a training strategy that aligns vision and language representations to improve fine-grained object understanding from text prompts without requiring explicit visual prompts during inference.

Contribution

It introduces SWIM, a novel method that uses mask supervision during training to enhance cross-modal attention and alignment in multimodal models.

Findings

01

SWIM improves text-visual alignment significantly.

02

Achieves better performance than visual-prompt methods on benchmarks.

03

Enriches dataset with NL-Refer for precise natural language object references.

Abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HumanMLLM/SWIM
github

Models

🤗
BBBBCHAN/SWIM-7B
model· 39 dl· ♡ 2
39 dl♡ 2

Datasets

BBBBCHAN/NL-Refer
dataset· 260 dl
260 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.