Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion
Huiyan Qi, Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Ee-Peng Lim

TL;DR
This paper introduces FastFood, a large dataset with nutritional annotations, and proposes VIF$^2$, a model that fuses visual and ingredient features to improve food nutrition estimation accuracy.
Contribution
The paper presents a new dataset and a novel model-agnostic fusion method that enhances nutrition estimation by integrating visual and ingredient information.
Findings
VIF$^2$ improves nutrition prediction accuracy across multiple backbones.
Ingredient information significantly boosts estimation performance.
Large multimodal models refine ingredient predictions during testing.
Abstract
Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
