FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis
Woojin Lee, Pranav Mekkoth, Ye Tian, Onat Gungor, Tajana Rosing

TL;DR
FoodCHA is a multi-modal framework that enhances fine-grained food recognition by hierarchical decision-making, leveraging a compact vision-language model for improved accuracy and practical deployment.
Contribution
It introduces a hierarchical, multi-modal approach for fine-grained food analysis using a lightweight vision-language model, improving recognition precision over existing methods.
Findings
Outperforms Food-Llama-3.2-11B in category and subcategory recognition by 13.8% and 38.2%.
Achieves 153.2% improvement in cooking style classification precision.
Utilizes a hierarchical decision process for better semantic consistency.
Abstract
The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
