Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval   with Deep Feature Engineering

Zhongwei Xie; Ling Liu; Yanzhao Wu; Luo Zhong; Lin Li

arXiv:2110.11592·cs.CV·October 25, 2021

Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering

Zhongwei Xie, Ling Liu, Yanzhao Wu, Luo Zhong, Lin Li

PDF

1 Repo

TL;DR

This paper presents a two-phase deep feature engineering framework for learning joint text-image embeddings, improving cross-modal retrieval performance by combining semantic features and advanced deep learning models.

Contribution

It introduces a novel two-phase approach separating data preprocessing from embedding training, utilizing deep NLP and image features for enhanced semantic alignment.

Findings

01

Outperforms state-of-the-art methods on Recipe1M dataset

02

Effective use of deep NLP models and image features improves semantic alignment

03

Significant accuracy gains in cross-modal retrieval tasks

Abstract

This paper introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using word2vec. We leverage wideResNet50 and word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

git-disl/seje
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Residual Connection · WordPiece · Dense Connections · Tanh Activation · Linear Warmup With Linear Decay