Disentangled Noisy Correspondence Learning

Zhuohang Dang; Minnan Luo; Jihong Wang; Chengyou Jia; Haochen Han,; Herun Wan; Guang Dai; Xiaojun Chang; Jingdong Wang

arXiv:2408.05503·cs.CV·August 13, 2024

Disentangled Noisy Correspondence Learning

Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han,, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

PDF

Open Access

TL;DR

DisNCL is a novel information-theoretic framework that improves cross-modal retrieval by effectively disentangling meaningful content from noise in noisy correspondence data, leading to better alignment and retrieval accuracy.

Contribution

DisNCL introduces a new approach using information bottlenecks for robust feature disentanglement and soft matching for noisy multi-modal data, advancing cross-modal retrieval methods.

Findings

01

Achieves 2% average recall improvement over baselines.

02

Learns meaningful modality-invariant and exclusive subspaces.

03

Demonstrates robustness to noisy correspondences through experiments.

Abstract

Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsMulti-partition Embedding Interaction