# An Improved Diffusion Model for Generating Images of a Single Category of Food on a Small Dataset

**Authors:** Zitian Chen, Zhiyong Xiao, Dinghui Wu, Qingbing Sang

PMC · DOI: 10.3390/foods15030443 · 2026-01-26

## TL;DR

This paper introduces a new diffusion model that generates high-quality food images using limited data, improving food classification accuracy through synthetic augmentation.

## Contribution

The novel Ingredient-Aware Diffusion Model with LIE and CA mechanisms enables high-fidelity food image synthesis in data-scarce scenarios.

## Key findings

- The proposed model achieves state-of-the-art generation quality on Food-101 and VireoFood-172 datasets.
- Using synthetic images for data augmentation improved downstream food classification accuracy from 95.65% to 96.20%.
- The model's linear interpolation strategy stabilizes training with limited data samples.

## Abstract

In the era of the digital food economy, high-fidelity food images are critical for applications ranging from visual e-commerce presentation to automated dietary assessment. However, developing robust computer vision systems for food analysis is often hindered by data scarcity for long-tail or regional dishes. To address this challenge, we propose a novel high-fidelity food image synthesis framework as an effective data augmentation tool. Unlike generic generative models, our method introduces an Ingredient-Aware Diffusion Model based on the Masked Diffusion Transformer (MaskDiT) architecture. Specifically, we design a Label and Ingredients Encoding (LIE) module and a Cross-Attention (CA) mechanism to explicitly model the relationship between food composition and visual appearance, simulating the “cooking” process digitally. Furthermore, to stabilize training on limited data samples, we incorporate a linear interpolation strategy into the diffusion process. Extensive experiments on the Food-101 and VireoFood-172 datasets demonstrate that our method achieves state-of-the-art generation quality even in data-scarce scenarios. Crucially, we validate the practical utility of our synthetic images: utilizing them for data augmentation improved the accuracy of downstream food classification tasks from 95.65% to 96.20%. This study provides a cost-effective solution for generating diverse, controllable, and realistic food data to advance smart food systems.

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12896454/full.md

---
Source: https://tomesphere.com/paper/PMC12896454