Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Mustafa Shukor, Nicolas Thome, Matthieu Cord

TL;DR
VLPCook introduces a novel approach that adapts vision-language pretraining to structured-text data for improved cross-modal food retrieval and other structured domain tasks, significantly outperforming existing methods.
Contribution
The paper presents VLPCook, a new method that transforms image-text pairs into structured data for pretraining and fine-tunes with foundation models, advancing structured-text based vision-language tasks.
Findings
Achieves +3.3% Recall@1 on Recipe1M dataset
Outperforms current state-of-the-art in cross-modal food retrieval
Demonstrates generalization to medical domain with structured text
Abstract
Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
