MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval
Bhanu Prakash Voutharoja, Peng Wang, Lei Wang, Vivienne Guan

TL;DR
This paper introduces MALM, a novel mask-augmentation based local matching network for image-to-recipe retrieval, which improves cross-modality representation learning by combining local matching with masked self-distillation, outperforming state-of-the-art methods.
Contribution
The paper proposes a new local matching framework with mask augmentation and self-distillation to enhance generalizable cross-modality representations in food-recipe retrieval.
Findings
Outperforms state-of-the-art on Recipe1M dataset
Effectively locates fine-grained cross-modality correspondences
Enhances generalization through masked self-distillation
Abstract
Image-to-recipe retrieval is a challenging vision-to-language task of significant practical value. The main challenge of the task lies in the ultra-high redundancy in the long recipe and the large variation reflected in both food item combination and food item appearance. A de-facto idea to address this task is to learn a shared feature embedding space in which a food image is aligned better to its paired recipe than other recipes. However, such supervised global matching is prone to supervision collapse, i.e., only partial information that is necessary for distinguishing training pairs can be identified, while other information that is potentially useful in generalization could be lost. To mitigate such a problem, we propose a mask-augmentation-based local matching network (MALM), where an image-text matching module and a masked self-distillation module benefit each other mutually to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
