Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang, Haojie Yu, Hai Zhao, Rui Wang, Masao Utiyama

TL;DR
This paper introduces a multimodal approach that enhances word representations by integrating visual guidance, leading to improved disambiguation and performance across various natural language understanding and translation tasks.
Contribution
It proposes a novel visual guidance method for word embeddings using a small image dictionary, improving contextual disambiguation in language models.
Findings
Enhanced disambiguation accuracy in word representations.
Improved performance on 12 NLP and translation tasks.
Effective integration of visual and textual information.
Abstract
Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally give more accurate contextualized word representations than non-contextualized models do, they are still subject to a sequence of text contexts without diverse hints for word representation from multimodality. This paper thus proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. In detail, we build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. The texts and paired images are encoded in parallel, followed by an attention layer to integrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
