# DKTransformer: An Accurate and Efficient Model for Fine-Grained Food Image Classification

**Authors:** Hongjuan Wang, Chenxi Wang, Xinjun An

PMC · DOI: 10.3390/s26041157 · 2026-02-11

## TL;DR

DKTransformer is a new model that improves accuracy and efficiency in classifying detailed food images by combining Vision Transformers and CNNs.

## Contribution

Proposes DKTransformer, a hybrid model combining ViT and CNN with novel modules for efficient fine-grained food classification.

## Key findings

- DKTransformer achieves 92.71% Top-1 accuracy on ETH Food-101 with 47 M parameters and 7.21 G FLOPs.
- It reaches 90.70% accuracy on Vireo-Food-172 and 66.89% on Food-500, showing strong generalization.
- The model balances accuracy and efficiency for complex food image classification tasks.

## Abstract

With the rapid development of dietary analysis and health computing, food image classification has attracted increasing attention. However, this task remains challenging due to the fine-grained nature of food categories. Different classes are visually similar, whereas samples within the same class exhibit large appearance variations. Existing methods often rely excessively on either global or local features, limiting their effectiveness in complex food scenes. To address these challenges, this paper proposes DKTransformer, a lightweight hybrid architecture that combines Vision Transformers (ViT) and convolutional neural networks (CNNs) for fine-grained food image classification. Specifically, DKTransformer introduces a Local Feature Extraction (LDE) module based on depthwise separable convolution to enhance local detail modeling. Furthermore, a Multi-Scale Dilated Attention (MSDA) module is designed to capture long-range dependencies with reduced computational cost while suppressing background interference. In addition, an Efficient Kolmogorov–Arnold Network (EfficientKAN) is employed to replace the conventional feedforward network, further reducing parameter redundancy. Experimental results on three public food image datasets—ETH Food-101, Vireo-Food-172, and ISIA Food-500—demonstrate the effectiveness of the proposed method. In particular, DKTransformer achieves a Top-1 accuracy of 92.71% on the ETH Food-101 dataset with 47 M parameters and 7.21 G FLOPs. Moreover, DKTransformer attains 90.70% Top-1 accuracy on Vireo-Food-172 and 66.89% on Food-500, indicating strong generalization across different food styles and dataset scales. These results suggest that DKTransformer achieves a favorable balance between accuracy and efficiency for fine-grained food image classification.

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** diabetic retinopathy (MESH:D003930), injury to (MESH:D014947), brain tumor (MESH:D001932), skin cancer (MESH:D012878)
- **Chemicals:** ETH Food-101 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12944211/full.md

---
Source: https://tomesphere.com/paper/PMC12944211