Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval
Donghuo Zeng, Yanan Wang, Jianming Wu, and Kazushi Ikeda

TL;DR
This paper introduces a novel cross-modal retrieval model that uses complete cross-triplet loss in label space to better align audio-visual data, reducing the impact of hard negatives and improving retrieval accuracy.
Contribution
The proposed model directly predicts labels and employs complete cross-triplet loss in label space, enhancing cross-modal retrieval performance over existing methods.
Findings
Achieved approximately 2.1% improvement in average MAP over TNN-CCCA.
Effectively reduces interference from hard negative samples.
Demonstrated superior performance on two audio-visual datasets.
Abstract
The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsTriplet Loss
