Beyond Images: Adaptive Fusion of Visual and Textual Data for Food Classification

Prateek Mittal; Puneet Goyal; Joohi Chauhan

arXiv:2308.02562·cs.CV·August 6, 2025

Beyond Images: Adaptive Fusion of Visual and Textual Data for Food Classification

Prateek Mittal, Puneet Goyal, Joohi Chauhan

PDF

Open Access

TL;DR

This paper presents a new adaptive multimodal fusion framework for food classification that combines visual and textual data to improve accuracy and robustness, outperforming existing methods on a standard dataset.

Contribution

The study introduces a dynamic fusion strategy that adaptively integrates visual and textual features, effectively handling missing or inconsistent modality data in food recognition.

Findings

01

Achieved 97.84% accuracy with multimodal fusion, surpassing state-of-the-art methods.

02

Demonstrated robustness and efficiency of the proposed fusion approach.

03

Validated effectiveness on the UPMC Food-101 dataset.

Abstract

This study introduces a novel multimodal food recognition framework that effectively combines visual and textual modalities to enhance classification accuracy and robustness. The proposed approach employs a dynamic multimodal fusion strategy that adaptively integrates features from unimodal visual inputs and complementary textual metadata. This fusion mechanism is designed to maximize the use of informative content, while mitigating the adverse impact of missing or inconsistent modality data. The framework was rigorously evaluated on the UPMC Food-101 dataset and achieved unimodal classification accuracies of 73.60% for images and 88.84% for text. When both modalities were fused, the model achieved an accuracy of 97.84%, outperforming several state-of-the-art methods. Extensive experimental analysis demonstrated the robustness, adaptability, and computational efficiency of the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCulinary Culture and Tourism · Image Retrieval and Classification Techniques · Text and Document Classification Technologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · *Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Depthwise Convolution · Tanh Activation · Pointwise Convolution · Depthwise Separable Convolution · Linear Layer · Adam