Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

Arya Narang

arXiv:2511.11705·cs.LG·November 18, 2025

Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

Arya Narang

PDF

Open Access

TL;DR

This study evaluates how combining image and text data improves calorie estimation accuracy, showing a modest but statistically significant reduction in mean absolute error using a multimodal neural network.

Contribution

It demonstrates that integrating short dish names with images enhances calorie prediction accuracy over image-only models, using a multimodal CNN architecture.

Findings

01

Multimodal model reduces MAE by 1.06 kcal

02

Statistical significance confirmed for improvement

03

Uses Nutrition5k dataset and TensorFlow

Abstract

This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNutritional Studies and Diet · Nutrition and Health in Aging · Consumer Attitudes and Food Labeling