Semi-supervised Visual Feature Integration for Pre-trained Language Models
Lisai Zhang, Qingcai Chen, Dongfang Li, Buzhou Tang

TL;DR
This paper introduces a semi-supervised visual feature integration method for pre-trained language models that improves natural language understanding tasks without needing aligned images for each sentence.
Contribution
The proposed framework allows visual features to be integrated into language models without requiring aligned image-sentence pairs, using a visualization and fusion mechanism.
Findings
Improves performance on natural language inference tasks
Enhances reading comprehension accuracy
Operates efficiently with only an image database
Abstract
Integrating visual features has been proved useful for natural language understanding tasks. Nevertheless, in most existing multimodal language models, the alignment of visual and textual data is expensive. In this paper, we propose a novel semi-supervised visual integration framework for pre-trained language models. In the framework, the visual features are obtained through a visualization and fusion mechanism. The uniqueness includes: 1) the integration is conducted via a semi-supervised approach, which does not require aligned images for every sentences 2) the visual features are integrated as an external component and can be directly used by pre-trained language models. To verify the efficacy of the proposed framework, we conduct the experiments on both natural language inference and reading comprehension tasks. The results demonstrate that our mechanism brings improvement to two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
