Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training
Jing Yang, Junwen Chen, Keiji Yanai

TL;DR
This paper introduces TNLBT, a transformer-based framework for cross-modal recipe retrieval and image generation, utilizing large batch training and self-supervised learning to outperform existing methods on Recipe1M.
Contribution
The paper presents the first validation of large batch training effectiveness in cross-modal recipe embedding models, combining hierarchical and vision transformers with adversarial learning.
Findings
Significantly outperforms state-of-the-art in retrieval and image generation
Validates large batch training benefits in cross-modal embedding tasks
Integrates self-supervised learning for recipe text understanding
Abstract
In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME~(Adversarial Cross-Modal Embedding) and H-T~(Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer~(ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
