FMiFood: Multi-modal Contrastive Learning for Food Image Classification
Xinyue Pan, Jiangpeng He, Fengqing Zhu

TL;DR
FMiFood introduces a multi-modal contrastive learning framework that combines food images and contextual text descriptions, including GPT-4 enriched data, to improve food image classification accuracy amidst intra-class diversity and inter-class similarity.
Contribution
The paper presents a novel multi-modal contrastive learning approach that integrates textual context and a flexible matching technique to enhance food image classification performance.
Findings
Improved accuracy on UPMC-101 and VFN datasets.
Effective integration of GPT-4 enriched descriptions.
Enhanced discriminative feature learning for food images.
Abstract
Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Chemical Sensor Technologies · Identification and Quantification in Food · Culinary Culture and Tourism
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
