Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings
Micael Carvalho, R\'emi Cad\`ene, David Picard, Laure Soulier, Nicolas, Thome, Matthieu Cord

TL;DR
This paper introduces a cross-modal retrieval model that aligns visual and textual cooking data in a shared space, enabling efficient large-scale retrieval and improving upon previous models, validated on a large recipe dataset.
Contribution
The paper presents a novel learning scheme for cross-modal retrieval that effectively handles large-scale cooking data, advancing the state-of-the-art in semantic text-image embedding.
Findings
Outperforms previous state-of-the-art models on Recipe1M dataset
Effective in large-scale retrieval tasks
Qualitative results demonstrate practical cooking use cases
Abstract
Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an effective learning scheme, capable of tackling large-scale problems, and validate it on the Recipe1M dataset containing nearly 1 million picture-recipe pairs. We show the effectiveness of our approach regarding previous state-of-the-art models and present qualitative results over computational cooking use cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Image Retrieval and Classification Techniques
