AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe

TL;DR
AlignCAT introduces a novel weakly supervised visual grounding framework that improves cross-modal reasoning by combining coarse- and fine-grained semantic alignment, leading to better object localization in images based on text descriptions.
Contribution
The paper proposes AlignCAT, a new query-based framework with dual alignment modules that enhance visual-linguistic matching for weakly supervised visual grounding.
Findings
Outperforms existing methods on RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
Effectively distinguishes subtle semantic differences in text descriptions.
Improves contrastive learning efficiency through progressive filtering.
Abstract
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques
