Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Yale Song, Mohammad Soleymani

TL;DR
This paper introduces PIE-Nets, a novel approach for visual-semantic embedding that generates multiple representations for polysemous instances, improving cross-modal retrieval for images and videos, including a new dataset for video-text tasks.
Contribution
The paper proposes a multi-head self-attention based embedding method for polysemous instances and introduces a new large-scale video-text dataset for retrieval tasks.
Findings
PIE-Nets outperform existing methods on image-text retrieval.
Effective handling of polysemous instances improves retrieval accuracy.
New dataset enables research in video-text retrieval.
Abstract
Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
