Deep Learning Applied to Image and Text Matching
Afroze Ibrahim Baqapuri

TL;DR
This paper presents a deep learning system for bidirectional image and sentence retrieval using CNNs and MLPs, embedding multimodal data into a shared space for similarity comparison, with various textual models tested.
Contribution
The study introduces a simpler yet effective multimodal embedding approach for image-text matching, exploring different textual models and training strategies for improved retrieval performance.
Findings
Comparable performance to recent methods despite simplicity
Training data negative sampling significantly affects results
Different textual models impact retrieval accuracy
Abstract
The ability to describe images with natural language sentences is the hallmark for image and language understanding. Such a system has wide ranging applications such as annotating images and using natural sentences to search for images.In this project we focus on the task of bidirectional image retrieval: such asystem is capable of retrieving an image based on a sentence (image search) andretrieve sentence based on an image query (image annotation). We present asystem based on a global ranking objective function which uses a combinationof convolutional neural networks (CNN) and multi layer perceptrons (MLP).It takes a pair of image and sentence and processes them in different channels,finally embedding it into a common multimodal vector space. These embeddingsencode abstract semantic information about the two inputs and can be comparedusing traditional information retrieval approaches.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
