Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval
Donghuo Zeng, Kazushi Ikeda

TL;DR
This paper introduces a two-stage curriculum learning approach for audio-visual retrieval that progressively trains with semi-hard and hard triplets, significantly improving retrieval performance.
Contribution
It proposes a novel two-stage training paradigm with curriculum augmentation to better distinguish semi-hard and hard triplets in cross-modal retrieval models.
Findings
Achieved approximately 9.8% improvement in MAP over state-of-the-art methods.
Effectively identifies and utilizes hard negatives through embedding augmentation.
Demonstrated significant performance gains on AVE dataset.
Abstract
The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
