Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal   Retrieval

Donghuo Zeng; Yanan Wang; Jianming Wu; and Kazushi Ikeda

arXiv:2211.03434·cs.MM·November 8, 2022

Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

Donghuo Zeng, Yanan Wang, Jianming Wu, and Kazushi Ikeda

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal retrieval model that uses complete cross-triplet loss in label space to better align audio-visual data, reducing the impact of hard negatives and improving retrieval accuracy.

Contribution

The proposed model directly predicts labels and employs complete cross-triplet loss in label space, enhancing cross-modal retrieval performance over existing methods.

Findings

01

Achieved approximately 2.1% improvement in average MAP over TNN-CCCA.

02

Effectively reduces interference from hard negative samples.

03

Demonstrated superior performance on two audio-visual datasets.

Abstract

The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsTriplet Loss