Coarse-to-fine Alignment Makes Better Speech-image Retrieval
Lifeng Zhou, Yuke Li

TL;DR
This paper introduces a unified coarse-to-fine contrastive learning framework for speech-image retrieval, combining SIC and SIM tasks with advanced negative sampling and momentum distillation to improve accuracy and generalization.
Contribution
The paper presents a novel joint training framework with an embedding queue and momentum distillation for enhanced speech-image retrieval performance.
Findings
Outperforms state-of-the-art by over 4% in R@1 on benchmark datasets.
Demonstrates strong zero-shot generalization capabilities.
Effectively mines hard negatives to improve alignment quality.
Abstract
In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Advanced Data Compression Techniques
MethodsALIGN
