TL;DR
This paper introduces OmniFood8K, a large multimodal dataset for Chinese food nutrition estimation, and proposes a novel RGB-to-nutrition framework utilizing depth prediction and frequency domain feature fusion.
Contribution
It provides a new comprehensive dataset for Chinese cuisine and develops an end-to-end RGB-based nutrition prediction model with innovative frequency-aligned feature fusion.
Findings
Our method outperforms existing approaches on multiple datasets.
The hierarchical frequency-aligned fusion improves feature representation.
The synthetic dataset enhances model robustness and generalization.
Abstract
Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models' capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
