TL;DR
This paper introduces a dual-path convolutional network with an instance loss for improved image-text embedding, enhancing retrieval accuracy especially in language-based person retrieval, by better capturing intra-modal data distribution.
Contribution
It proposes a novel instance loss for intra-modal data distribution modeling and develops an end-to-end dual-path network for image-text embedding.
Findings
Better initialization for ranking loss improves discriminative embeddings.
Achieves state-of-the-art results in language-based person retrieval.
Competitive accuracy on Flickr30k and MSCOCO datasets.
Abstract
Matching images and sentences demands a fine understanding of both modalities. In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image / text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss is hard for network learning, since it starts from the two heterogeneous features to build inter-modal relationship. To address this problem, we propose the instance loss which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image / text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
