FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Woojin Lee; Pranav Mekkoth; Ye Tian; Onat Gungor; Tajana Rosing

arXiv:2605.05499·cs.AI·May 8, 2026

FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

Woojin Lee, Pranav Mekkoth, Ye Tian, Onat Gungor, Tajana Rosing

PDF

TL;DR

FoodCHA is a multi-modal framework that enhances fine-grained food recognition by hierarchical decision-making, leveraging a compact vision-language model for improved accuracy and practical deployment.

Contribution

It introduces a hierarchical, multi-modal approach for fine-grained food analysis using a lightweight vision-language model, improving recognition precision over existing methods.

Findings

01

Outperforms Food-Llama-3.2-11B in category and subcategory recognition by 13.8% and 38.2%.

02

Achieves 153.2% improvement in cooking style classification precision.

03

Utilizes a hierarchical decision process for better semantic consistency.

Abstract

The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.