FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
Yuki Imajuku, Yoko Yamakata, Kiyoharu Aizawa

TL;DR
This paper demonstrates that fine-tuned open multimodal large language models trained on Japanese recipe data outperform GPT-4o in ingredient generation and match its performance in cooking procedure generation, advancing food image understanding in Japanese cuisine.
Contribution
The study introduces fine-tuned open MLLMs on Japanese recipes, showing superior ingredient generation and comparable procedure generation compared to GPT-4o.
Findings
Open models outperform GPT-4o in ingredient generation (F1 0.531 vs 0.481)
Open models match GPT-4o in cooking procedure generation
Fine-tuning on Japanese recipes enhances food image understanding
Abstract
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Digital Humanities and Scholarship
MethodsFocus
