TL;DR
V-Nutri leverages egocentric cooking videos and process cues to improve dish-level nutrition estimation, addressing limitations of static image-based methods by incorporating cooking process information.
Contribution
The paper introduces V-Nutri, a novel staged framework combining visual backbones and process keyframes for enhanced nutrition estimation from egocentric videos.
Findings
Process cues improve nutrition estimation accuracy.
Backbone capacity and event detection quality influence benefits.
Annotated HD-EPIC dataset and benchmark established.
Abstract
Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
