Target-Oriented Deformation of Visual-Semantic Embedding Space
Takashi Matsubara

TL;DR
This paper introduces TOD-Net, a post-processing module that deforms the embedding space to improve cross-modal retrieval by emphasizing entity-specific concepts and handling diversity.
Contribution
The paper presents TOD-Net, a novel deformation network that enhances existing multimodal embeddings for better cross-modal retrieval performance.
Findings
Achieves state-of-the-art results on MSCOCO dataset.
Effectively emphasizes entity-specific concepts.
Handles higher diversity in retrieval tasks.
Abstract
Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby adjusting similarities between entities. Unlike methods based on cross-modal attention, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
