AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Yidan Wang; Chenyi Zhuang; Wutao Liu; Pan Gao; Nicu Sebe

arXiv:2508.03201·cs.CV·October 28, 2025

AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding

Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe

PDF

Open Access

TL;DR

AlignCAT introduces a novel weakly supervised visual grounding framework that improves cross-modal reasoning by combining coarse- and fine-grained semantic alignment, leading to better object localization in images based on text descriptions.

Contribution

The paper proposes AlignCAT, a new query-based framework with dual alignment modules that enhance visual-linguistic matching for weakly supervised visual grounding.

Findings

01

Outperforms existing methods on RefCOCO, RefCOCO+, and RefCOCOg benchmarks.

02

Effectively distinguishes subtle semantic differences in text descriptions.

03

Improves contrastive learning efficiency through progressive filtering.

Abstract

Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques