Improving Visual-Semantic Embedding with Adaptive Pooling and   Optimization Objective

Zijian Zhang; Chang Shu; Ya Xiao; Yuan Shen; Di Zhu; Jing Xiao; Youxin; Chen; Jey Han Lau; Qian Zhang; Zheng Lu

arXiv:2210.02206·cs.MM·October 6, 2022

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, Jing Xiao, Youxin, Chen, Jey Han Lau, Qian Zhang, Zheng Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces adaptive pooling and negative sample selection strategies to improve visual-semantic embedding models, leading to better retrieval performance on standard datasets.

Contribution

It proposes simple yet effective pooling and optimization strategies that outperform complex methods and enhance convergence in VSE models.

Findings

01

Outperforms state-of-the-art on Flickr30K and MS-COCO datasets.

02

Improves Recall@K metrics by at least 1.0%.

03

Simple pooling methods are as effective as sophisticated ones.

Abstract

Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

96-zachary/vse_2ad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning