TL;DR
This paper introduces TRM-ML, a novel multi-label image recognition method that improves text-vision matching by focusing on category-aware regions, multimodal contrastive learning, and label estimation to handle missing labels effectively.
Contribution
The paper proposes a new approach that enhances cross-modal matching and label estimation in multi-label recognition with missing labels, outperforming existing methods.
Findings
Outperforms state-of-the-art on multiple benchmarks.
Effectively handles missing labels through category prototypes.
Improves text-vision semantic alignment with region-based matching.
Abstract
Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose ext-egion atching for optimizing ulti-abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
