Universal Multimodal Representation for Language Understanding
Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita,, Zuchao Li, Hai Zhao

TL;DR
This paper introduces a universal multimodal representation learning approach that incorporates visual signals into NLP tasks, enhancing understanding and performance across various language tasks without requiring large-scale bilingual image-text datasets.
Contribution
The work presents a flexible retrieval and fusion method for visual information in NLP, enabling effective multimodal learning without extensive annotated corpora.
Findings
Improves performance on translation, inference, and similarity tasks.
Visual signals enrich textual content representations.
Method is effective across multiple languages and tasks.
Abstract
Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Linear Layer · Dropout · Softmax · Residual Connection · Label Smoothing
