Transformer-based Cross-Modal Recipe Embeddings with Large Batch   Training

Jing Yang; Junwen Chen; Keiji Yanai

arXiv:2205.04948·cs.CV·December 19, 2022

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

Jing Yang, Junwen Chen, Keiji Yanai

PDF

Open Access

TL;DR

This paper introduces TNLBT, a transformer-based framework for cross-modal recipe retrieval and image generation, utilizing large batch training and self-supervised learning to outperform existing methods on Recipe1M.

Contribution

The paper presents the first validation of large batch training effectiveness in cross-modal recipe embedding models, combining hierarchical and vision transformers with adversarial learning.

Findings

01

Significantly outperforms state-of-the-art in retrieval and image generation

02

Validates large batch training benefits in cross-modal embedding tasks

03

Integrates self-supervised learning for recipe text understanding

Abstract

In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME~(Adversarial Cross-Modal Embedding) and H-T~(Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer~(ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning