Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs
Arya Narang

TL;DR
This study evaluates how combining image and text data improves calorie estimation accuracy, showing a modest but statistically significant reduction in mean absolute error using a multimodal neural network.
Contribution
It demonstrates that integrating short dish names with images enhances calorie prediction accuracy over image-only models, using a multimodal CNN architecture.
Findings
Multimodal model reduces MAE by 1.06 kcal
Statistical significance confirmed for improvement
Uses Nutrition5k dataset and TensorFlow
Abstract
This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNutritional Studies and Diet · Nutrition and Health in Aging · Consumer Attitudes and Food Labeling
