Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Lifeng Zhou; Yuke Li

arXiv:2408.13119·cs.CL·September 12, 2024

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Lifeng Zhou, Yuke Li

PDF

Open Access

TL;DR

This paper introduces a unified coarse-to-fine contrastive learning framework for speech-image retrieval, combining SIC and SIM tasks with advanced negative sampling and momentum distillation to improve accuracy and generalization.

Contribution

The paper presents a novel joint training framework with an embedding queue and momentum distillation for enhanced speech-image retrieval performance.

Findings

01

Outperforms state-of-the-art by over 4% in R@1 on benchmark datasets.

02

Demonstrates strong zero-shot generalization capabilities.

03

Effectively mines hard negatives to improve alignment quality.

Abstract

In this paper, we propose a novel framework for speech-image retrieval. We utilize speech-image contrastive (SIC) learning tasks to align speech and image representations at a coarse level and speech-image matching (SIM) learning tasks to further refine the fine-grained cross-modal alignment. SIC and SIM learning tasks are jointly trained in a unified manner. To optimize the learning process, we utilize an embedding queue that facilitates efficient sampling of high-quality and diverse negative representations during SIC learning. Additionally, it enhances the learning of SIM tasks by effectively mining hard negatives based on contrastive similarities calculated in SIC tasks. To further optimize learning under noisy supervision, we incorporate momentum distillation into the training process. Experimental results show that our framework outperforms the state-of-the-art method by more than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Advanced Data Compression Techniques

MethodsALIGN