Learning the Best Pooling Strategy for Visual Semantic Embedding

Jiacheng Chen; Hexiang Hu; Hao Wu; Yuning Jiang; Changhu Wang

arXiv:2011.04305·cs.CV·July 7, 2021·23 cites

Learning the Best Pooling Strategy for Visual Semantic Embedding

Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Generalized Pooling Operator (GPO) that automatically learns the optimal pooling strategy for visual semantic embedding tasks, significantly improving performance across image and video retrieval benchmarks.

Contribution

The paper proposes GPO, a learnable pooling method that adapts to different features, enhancing VSE models without manual tuning and achieving state-of-the-art results.

Findings

01

GPO outperforms fixed pooling functions across various feature extractors.

02

VSE∞ with GPO surpasses previous methods on image-text retrieval benchmarks.

03

Variants of VSE∞ achieve new state-of-the-art on video-text retrieval datasets.

Abstract

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

woodfrog/vse_infty
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning