VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri; David J. Fleet; Jamie Ryan Kiros; Sanja Fidler

arXiv:1707.05612·cs.LG·July 31, 2018·580 cites

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

PDF

Open Access 5 Repos

TL;DR

VSE++ introduces a simple yet effective modification to loss functions for visual-semantic embeddings, leveraging hard negative mining and data augmentation to significantly improve cross-modal retrieval performance.

Contribution

The paper proposes a novel loss function modification and training strategy that enhances visual-semantic embedding quality for cross-modal retrieval tasks.

Findings

01

Outperforms state-of-the-art on MS-COCO by 8.8% in caption retrieval

02

Achieves 11.3% improvement in image retrieval on MS-COCO

03

Demonstrates effectiveness through ablation studies and comparisons

Abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling