SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo, Ren, Ming-Ming Cheng

TL;DR
SLAN is a novel network that enhances cross-modal understanding by localizing key image regions based on text, improving performance on multiple tasks without requiring additional annotated data.
Contribution
SLAN introduces a region filter and adaptor that localize and update image regions conditioned on text, enabling better semantic alignment without extra gold data.
Findings
Achieves state-of-the-art results on five cross-modal tasks
Demonstrates strong zero-shot transferability
Outperforms previous methods in image-text retrieval
Abstract
Learning fine-grained interplay between vision and language allows to a more accurate understanding for VisionLanguage tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by textagnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to its heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating cross-modal information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
