Vision and Structured-Language Pretraining for Cross-Modal Food   Retrieval

Mustafa Shukor; Nicolas Thome; Matthieu Cord

arXiv:2212.04267·cs.CV·March 17, 2023

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Mustafa Shukor, Nicolas Thome, Matthieu Cord

PDF

Open Access 1 Repo

TL;DR

VLPCook introduces a novel approach that adapts vision-language pretraining to structured-text data for improved cross-modal food retrieval and other structured domain tasks, significantly outperforming existing methods.

Contribution

The paper presents VLPCook, a new method that transforms image-text pairs into structured data for pretraining and fine-tunes with foundation models, advancing structured-text based vision-language tasks.

Findings

01

Achieves +3.3% Recall@1 on Recipe1M dataset

02

Outperforms current state-of-the-art in cross-modal food retrieval

03

Demonstrates generalization to medical domain with structured text

Abstract

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mshukor/vlpcook
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning