TL;DR
CaLoRAify is a vision-language model framework that uses a large curated dataset and LoRA techniques to accurately estimate calories from food images while supporting conversational interactions.
Contribution
The paper introduces CaLoRAify, a novel VLM framework utilizing LoRA and RAG techniques with a new dataset for improved calorie estimation from food images.
Findings
Achieved accurate calorie estimation from monocular food images.
Enhanced VLM performance in calorie estimation with LoRA and RAG.
Open-sourced code and dataset for community use.
Abstract
The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
