TL;DR
This paper introduces MSJE, a multi-modal joint embedding method that leverages TFIDF features and LSTM networks to improve cross-modal retrieval of recipes and images, outperforming existing methods.
Contribution
The paper proposes a novel multi-modal embedding approach that integrates TFIDF features with LSTM-based sequence modeling for better recipe-image retrieval.
Findings
MSJE outperforms state-of-the-art methods on Recipe1M dataset.
TFIDF features enhance the semantic understanding of recipes.
Combining TFIDF with sequence features improves retrieval accuracy.
Abstract
It is widely acknowledged that learning joint embeddings of recipes with images is challenging due to the diverse composition and deformation of ingredients in cooking procedures. We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance cross-modal retrieval services. Our MSJE approach has three unique features. First, we extract the TFIDF feature from the title, ingredients and cooking instructions of recipes. By determining the significance of word sequences through combining LSTM learned features with their TFIDF features, we encode a recipe into a TFIDF weighted vector for capturing significant key terms and how such key terms are used in the corresponding cooking instructions. Second, we combine the recipe TFIDF feature with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
