TL;DR
This paper investigates how multilingual BERT encodes word-level translation information, revealing that it contains both language-specific and cross-lingual components, which can be extracted with simple methods without fine-tuning.
Contribution
The authors introduce two straightforward methods to extract translation capabilities from mBERT, and identify an empirical language-identity subspace within its representations.
Findings
Most translation information is non-linearly encoded in mBERT.
Some translation information can be recovered with linear tools.
An empirical language-identity subspace exists within mBERT representations.
Abstract
Recent works have demonstrated that multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning. The results suggest that most of this information is encoded in a non-linear way, while some of it can also be recovered with purely linear tools. As part of our analysis, we test the hypothesis that mBERT learns representations which contain both a language-encoding component and an abstract, cross-lingual component, and explicitly identify an empirical language-identity subspace within mBERT representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · mBERT · WordPiece · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
