Finding beans in burgers: Deep semantic-visual embedding with   localization

Martin Engilberge; Louis Chevallier; Patrick P\'erez; Matthieu Cord

arXiv:1804.01720·cs.CV·April 9, 2018

Finding beans in burgers: Deep semantic-visual embedding with localization

Martin Engilberge, Louis Chevallier, Patrick P\'erez, Matthieu Cord

PDF

1 Repo

TL;DR

This paper introduces a novel multi-modal embedding architecture that combines space-aware pooling with joint training, achieving state-of-the-art results in image-caption retrieval and phrase grounding.

Contribution

A new deep semantic-visual embedding model with space-aware pooling and joint training that improves cross-modal retrieval and localization tasks.

Findings

01

Achieves state-of-the-art performance on cross-modal retrieval.

02

Provides accurate localization of concepts within images.

03

Demonstrates versatility in multiple vision-language tasks.

Abstract

Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

technicolor-research/dsve-loc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.