SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
Keisuke Gomi, Keiji Yanai

TL;DR
SIMMER leverages Multimodal Large Language Models to unify food image and recipe text embedding, achieving state-of-the-art cross-modal retrieval performance on Recipe1M.
Contribution
It introduces a single-encoder approach using MLLM-based embeddings with tailored prompts and data augmentation, replacing complex dual-encoder architectures.
Findings
Achieves R@1 of 87.5% on 1k image-to-recipe retrieval.
Outperforms previous methods significantly on Recipe1M dataset.
Demonstrates robustness to incomplete recipe inputs.
Abstract
Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
