An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-Level Structural Information
Zejun Li, Zhongyu Wei, Zhihao Fan, Haijun Shan, Xuanjing Huang

TL;DR
This paper introduces an unsupervised image-sentence matching method that leverages document-level structural information, employing a Transformer-based model to improve alignment and reduce sampling bias.
Contribution
It proposes a novel sampling strategy and a Transformer-based model to better capture intra-document relationships and enhance multimodal representation learning.
Findings
Effective in reducing sampling bias
Improves alignment of images and sentences
Demonstrates superior performance on benchmark datasets
Abstract
In this paper, we focus on the problem of unsupervised image-sentence matching. Existing research explores to utilize document-level structural information to sample positive and negative instances for model training. Although the approach achieves positive results, it introduces a sampling bias and fails to distinguish instances with high semantic similarity. To alleviate the bias, we propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples. Furthermore, to recognize the complex pattern in intra-document samples, we propose a Transformer based model to capture fine-grained features and implicitly construct a graph for each document, where concepts in a document are introduced to bridge the representation learning of images and sentences in the context of a document. Experimental results show the effectiveness of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Layer Normalization · Label Smoothing · Adam
