Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of   Pre-trained Models' Transferability

Wei-Tsung Kao; Hung-Yi Lee

arXiv:2103.07162·cs.CL·April 20, 2022·1 cites

Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability

Wei-Tsung Kao, Hung-Yi Lee

PDF

Open Access

TL;DR

Pre-trained models like BERT, originally designed for text, can effectively transfer to non-text token sequence classification tasks, showing faster convergence and better performance than random models.

Contribution

This study demonstrates the transferability of text pre-trained models to non-text domains, revealing shared representations and transfer benefits beyond language tasks.

Findings

01

Pre-trained models outperform random models on non-text data.

02

Models pre-trained on text converge faster on non-text tasks.

03

Text and non-text pre-trained models share similar representations.

Abstract

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications. To verify pre-trained models' transferability, we test the pre-trained models on text classification tasks with meanings of tokens mismatches, and real-world non-text token sequence classification data, including amino acid, DNA, and music. We find that even on non-text data, the models pre-trained on text converge faster, perform better than the randomly initialized models, and only slightly worse than the models using task-specific knowledge. We also find that the representations of the text and non-text pre-trained models share non-trivial similarities.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Bioinformatics

MethodsLinear Layer · Dropout · Attention Is All You Need · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Dense Connections · Softmax · Layer Normalization