Disentangled Noisy Correspondence Learning
Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han,, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

TL;DR
DisNCL is a novel information-theoretic framework that improves cross-modal retrieval by effectively disentangling meaningful content from noise in noisy correspondence data, leading to better alignment and retrieval accuracy.
Contribution
DisNCL introduces a new approach using information bottlenecks for robust feature disentanglement and soft matching for noisy multi-modal data, advancing cross-modal retrieval methods.
Findings
Achieves 2% average recall improvement over baselines.
Learns meaningful modality-invariant and exclusive subspaces.
Demonstrates robustness to noisy correspondences through experiments.
Abstract
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsMulti-partition Embedding Interaction
