Universal Vision-Language Dense Retrieval: Learning A Unified   Representation Space for Multi-Modal Retrieval

Zhenghao Liu; Chenyan Xiong; Yuanhuiyi Lv; Zhiyuan Liu; Ge Yu

arXiv:2209.00179·cs.IR·February 7, 2023·6 cites

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces UniVL-DR, a unified model for multi-modal retrieval that encodes queries and resources into a shared embedding space, achieving state-of-the-art results on multi-modal benchmarks.

Contribution

It proposes a novel universal embedding optimization and image verbalization techniques to unify multi-modal retrieval in a single model.

Findings

01

Achieves state-of-the-art on WebQA benchmark.

02

Outperforms all models on text-text and text-image retrieval.

03

Demonstrates feasibility of universal multi-modal search.

Abstract

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openmatch/univl-dr
pytorchOfficial

Videos

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning