Universal Multimodal Representation for Language Understanding

Zhuosheng Zhang; Kehai Chen; Rui Wang; Masao Utiyama; Eiichiro Sumita,; Zuchao Li; Hai Zhao

arXiv:2301.03344·cs.CL·January 10, 2023

Universal Multimodal Representation for Language Understanding

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita,, Zuchao Li, Hai Zhao

PDF

TL;DR

This paper introduces a universal multimodal representation learning approach that incorporates visual signals into NLP tasks, enhancing understanding and performance across various language tasks without requiring large-scale bilingual image-text datasets.

Contribution

The work presents a flexible retrieval and fusion method for visual information in NLP, enabling effective multimodal learning without extensive annotated corpora.

Findings

01

Improves performance on translation, inference, and similarity tasks.

02

Visual signals enrich textual content representations.

03

Method is effective across multiple languages and tasks.

Abstract

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Linear Layer · Dropout · Softmax · Residual Connection · Label Smoothing