See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

TL;DR
SWIM is a training strategy that aligns vision and language representations to improve fine-grained object understanding from text prompts without requiring explicit visual prompts during inference.
Contribution
It introduces SWIM, a novel method that uses mask supervision during training to enhance cross-modal attention and alignment in multimodal models.
Findings
SWIM improves text-visual alignment significantly.
Achieves better performance than visual-prompt methods on benchmarks.
Enriches dataset with NL-Refer for precise natural language object references.
Abstract
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
