Two-Stage Triplet Loss Training with Curriculum Augmentation for   Audio-Visual Retrieval

Donghuo Zeng; Kazushi Ikeda

arXiv:2310.13451·cs.SD·October 23, 2023·1 cites

Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval

Donghuo Zeng, Kazushi Ikeda

PDF

Open Access

TL;DR

This paper introduces a two-stage curriculum learning approach for audio-visual retrieval that progressively trains with semi-hard and hard triplets, significantly improving retrieval performance.

Contribution

It proposes a novel two-stage training paradigm with curriculum augmentation to better distinguish semi-hard and hard triplets in cross-modal retrieval models.

Findings

01

Achieved approximately 9.8% improvement in MAP over state-of-the-art methods.

02

Effectively identifies and utilizes hard negatives through embedding augmentation.

03

Demonstrated significant performance gains on AVE dataset.

Abstract

The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training