XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
Chan-Jan Hsu, Hung-yi Lee, Yu Tsao

TL;DR
XDBERT is a novel approach that distills visual information from multimodal transformers into BERT, enhancing language understanding performance on multiple benchmarks by leveraging visual grounding.
Contribution
The paper introduces XDBERT, a framework that effectively transfers visual knowledge into BERT, improving its performance in natural language understanding tasks.
Findings
XDBERT outperforms pretrained BERT on GLUE, SWAG, and readability benchmarks.
Visual grounding contributes to the improved language understanding.
The method requires only a small number of adaptation steps.
Abstract
Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
