Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning
Ricardo Guerrero, Hai Xuan Pham, Vladimir Pavlovic

TL;DR
This paper introduces a transformer-based multilingual recipe encoder combined with image embeddings to improve cross-modal food retrieval and synthesis, effectively capturing joint semantics and outperforming state-of-the-art methods on Recipe1M.
Contribution
It presents a novel multilingual transformer-based recipe encoder with regularization via imperfect translations, enhancing shared representation learning for food data.
Findings
Significantly outperforms SOTA on retrieval tasks
Enables effective food image synthesis conditioned on recipe embeddings
Captures joint semantics of text and images in food data
Abstract
Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Genomics and Phylogenetic Studies
