DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod; Siddeshwar Raghavan; Bruce Coburn; Fengqing Zhu

arXiv:2604.06352·cs.CV·April 9, 2026

DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod, Siddeshwar Raghavan, Bruce Coburn, Fengqing Zhu

PDF

TL;DR

DietDelta introduces a vision-language framework that accurately assesses individual food items and their consumption from paired before-and-after images without requiring complex segmentation or depth sensing.

Contribution

It presents a novel approach leveraging natural language prompts and paired images for precise food-level nutritional analysis, surpassing existing methods.

Findings

01

Consistently outperforms existing dietary image analysis methods.

02

Effective in localizing and estimating weights of specific food items.

03

Establishes a new baseline for before-and-after dietary image assessment.

Abstract

Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.