PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification
Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo, Chen, Mingyuan Zhou

TL;DR
PatchCT introduces a novel approach for multi-label image classification by using conditional transport to align image patch embeddings with label embeddings, improving interpretability and performance without complex attention modules.
Contribution
The paper proposes a new method applying conditional transport to align patch and label sets, eliminating the need for complex attention-based alignment modules.
Findings
Outperforms previous methods on three public benchmarks
Provides interpretable visualization of learned prototypes
Efficiently models patch-label interactions via bidirectional CT
Abstract
Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsALIGN
