Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis
Srihari K B, Pushpak Bhattacharyya

TL;DR
This paper introduces a multimodal knowledge graph-based framework for food question answering, combining structured data and generative AI to improve accuracy, diversity, and fidelity in food-related queries and answers.
Contribution
It presents a novel unified food-domain QA system integrating a large-scale multimodal knowledge graph with generative models, enhancing answer quality and diversity.
Findings
Improved BERTScore by 16.2%
Reduced FID by 37.8%
Achieved 94.1% image reuse accuracy
Abstract
We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Nutritional Studies and Diet
