SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

Keisuke Gomi; Keiji Yanai

arXiv:2604.15628·cs.CV·April 20, 2026

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

Keisuke Gomi, Keiji Yanai

PDF

TL;DR

SIMMER leverages Multimodal Large Language Models to unify food image and recipe text embedding, achieving state-of-the-art cross-modal retrieval performance on Recipe1M.

Contribution

It introduces a single-encoder approach using MLLM-based embeddings with tailored prompts and data augmentation, replacing complex dual-encoder architectures.

Findings

01

Achieves R@1 of 87.5% on 1k image-to-recipe retrieval.

02

Outperforms previous methods significantly on Recipe1M dataset.

03

Demonstrates robustness to incomplete recipe inputs.

Abstract

Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.