Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning
Zhongwei Xie, Ling Liu, Lin Li, Luo Zhong

TL;DR
This paper proposes a two-phase deep feature calibration framework for efficient cross-modal text-image embedding, improving semantic alignment and outperforming existing methods on the Recipe1M dataset.
Contribution
It introduces a novel two-phase deep feature calibration approach that separates data preprocessing from joint embedding training, enhancing semantic alignment in cross-modal learning.
Findings
Significant performance improvement over state-of-the-art methods.
Effective semantic alignment of recipes and images.
Robustness demonstrated on the Recipe1M dataset.
Abstract
This paper introduces a two-phase deep feature calibration framework for efficient learning of semantics enhanced text-image cross-modal joint embedding, which clearly separates the deep feature calibration in data preprocessing from training the joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature calibration by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, NLP methods to produce ranking scores for key terms before generating the key term feature. We leverage wideResNet50 to extract and encode the image category semantics to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature calibration by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Triplet Loss
