VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal   Retrieval

Lisai Zhang; Hongfa Wu; Qingcai Chen; Yimeng Deng and; Zhonghua Li; Dejiang Kong; Zhao Cao; Joanna Siebert; Yunpeng Han

arXiv:2110.11338·cs.CV·November 29, 2021·1 cites

VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval

Lisai Zhang, Hongfa Wu, Qingcai Chen, Yimeng Deng and, Zhonghua Li, Dejiang Kong, Zhao Cao, Joanna Siebert, Yunpeng Han

PDF

Open Access

TL;DR

VLDeformer introduces a novel approach to vision-language retrieval by decomposing a transformer into separate stages, significantly boosting efficiency while maintaining high accuracy, making it suitable for real-time cross-modal search engines.

Contribution

The paper proposes VLDeformer, a decomposed transformer architecture that separates cross-modal retrieval into learning and indexing stages, greatly improving efficiency with minimal accuracy loss.

Findings

01

Achieves over 1000x speedup in retrieval tasks.

02

Maintains less than 0.6% recall drop after decomposition.

03

Outperforms state-of-the-art methods on COCO and Flickr30k datasets.

Abstract

Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dropout · Layer Normalization · Residual Connection · Absolute Position Encodings