Dissecting Deep Metric Learning Losses for Image-Text Retrieval

Hong Xuan; Xi Chen

arXiv:2210.13188·cs.CV·October 25, 2022·1 cites

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

Hong Xuan, Xi Chen

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces GOAL, a framework for analyzing and designing gradient-based objectives in deep metric learning for image-text retrieval, leading to improved performance and state-of-the-art results.

Contribution

The paper proposes a novel gradient analysis framework and new gradient-based objectives that enhance deep metric learning for image-text retrieval.

Findings

01

Consistently improved retrieval performance across various models.

02

Achieved state-of-the-art results on COCO and Flick30K datasets.

03

Demonstrated the generalizability of GOAL to different loss functions.

Abstract

Visual-Semantic Embedding (VSE) is a prevalent approach in image-text retrieval by learning a joint embedding space between the image and language modalities where semantic similarities would be preserved. The triplet loss with hard-negative mining has become the de-facto objective for most VSE methods. Inspired by recent progress in deep metric learning (DML) in the image domain which gives rise to new loss functions that outperform triplet loss, in this paper, we revisit the problem of finding better objectives for VSE in image-text matching. Despite some attempts in designing losses based on gradient movement, most DML losses are defined empirically in the embedding space. Instead of directly applying these loss functions which may lead to sub-optimal gradient updates in model parameters, in this paper we present a novel Gradient-based Objective AnaLysis framework, or \textit{GOAL},…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Dissecting Deep Metric Learning Losses for Image-Text Retrieval· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsTriplet Loss