Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability
Wei-Tsung Kao, Hung-Yi Lee

TL;DR
Pre-trained models like BERT, originally designed for text, can effectively transfer to non-text token sequence classification tasks, showing faster convergence and better performance than random models.
Contribution
This study demonstrates the transferability of text pre-trained models to non-text domains, revealing shared representations and transfer benefits beyond language tasks.
Findings
Pre-trained models outperform random models on non-text data.
Models pre-trained on text converge faster on non-text tasks.
Text and non-text pre-trained models share similar representations.
Abstract
This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications. To verify pre-trained models' transferability, we test the pre-trained models on text classification tasks with meanings of tokens mismatches, and real-world non-text token sequence classification data, including amino acid, DNA, and music. We find that even on non-text data, the models pre-trained on text converge faster, perform better than the randomly initialized models, and only slightly worse than the models using task-specific knowledge. We also find that the representations of the text and non-text pre-trained models share non-trivial similarities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Bioinformatics
MethodsLinear Layer · Dropout · Attention Is All You Need · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Dense Connections · Softmax · Layer Normalization
