Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

TL;DR
This paper introduces a semantically grounded framework for recipe generation that improves the accuracy of actions and ingredients in generated recipes by combining supervised and reinforcement fine-tuning, along with a validation module.
Contribution
It proposes a novel two-stage pipeline with semantic validation for recipe generation, enhancing semantic fidelity over previous multimodal models.
Findings
Achieves state-of-the-art performance on Recipe1M.
Significantly improves semantic accuracy of actions and ingredients.
Effective filtering and correction with SCSR module.
Abstract
Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Nutritional Studies and Diet · Generative Adversarial Networks and Image Synthesis
