VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

TL;DR
VSE++ introduces a simple yet effective modification to loss functions for visual-semantic embeddings, leveraging hard negative mining and data augmentation to significantly improve cross-modal retrieval performance.
Contribution
The paper proposes a novel loss function modification and training strategy that enhances visual-semantic embedding quality for cross-modal retrieval tasks.
Findings
Outperforms state-of-the-art on MS-COCO by 8.8% in caption retrieval
Achieves 11.3% improvement in image retrieval on MS-COCO
Demonstrates effectiveness through ablation studies and comparisons
Abstract
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
