XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems   to Improve Language Understanding

Chan-Jan Hsu; Hung-yi Lee; Yu Tsao

arXiv:2204.07316·cs.CL·May 4, 2022

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Chan-Jan Hsu, Hung-yi Lee, Yu Tsao

PDF

Open Access

TL;DR

XDBERT is a novel approach that distills visual information from multimodal transformers into BERT, enhancing language understanding performance on multiple benchmarks by leveraging visual grounding.

Contribution

The paper introduces XDBERT, a framework that effectively transfers visual knowledge into BERT, improving its performance in natural language understanding tasks.

Findings

01

XDBERT outperforms pretrained BERT on GLUE, SWAG, and readability benchmarks.

02

Visual grounding contributes to the improved language understanding.

03

The method requires only a small number of adaptation steps.

Abstract

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques