Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in   Shared Representation Learning

Ricardo Guerrero; Hai Xuan Pham; Vladimir Pavlovic

arXiv:2012.01345·cs.CV·October 1, 2021·1 cites

Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning

Ricardo Guerrero, Hai Xuan Pham, Vladimir Pavlovic

PDF

Open Access

TL;DR

This paper introduces a transformer-based multilingual recipe encoder combined with image embeddings to improve cross-modal food retrieval and synthesis, effectively capturing joint semantics and outperforming state-of-the-art methods on Recipe1M.

Contribution

It presents a novel multilingual transformer-based recipe encoder with regularization via imperfect translations, enhancing shared representation learning for food data.

Findings

01

Significantly outperforms SOTA on retrieval tasks

02

Enables effective food image synthesis conditioned on recipe embeddings

03

Captures joint semantics of text and images in food data

Abstract

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Genomics and Phylogenetic Studies