Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT
Beiduo Chen, Wu Guo, Quan Liu, Kun Tao

TL;DR
This paper introduces a feature aggregation method that combines information from multiple layers of multilingual BERT to improve zero-shot cross-lingual transfer tasks, demonstrating performance gains on several benchmarks.
Contribution
It proposes an attention-based feature aggregation module that leverages lower layers of mBERT, enhancing cross-lingual task performance beyond the last layer's output.
Findings
Performance improvements on XNLI, PAWS-X, NER, and POS tasks.
Lower layers of mBERT contain useful information for cross-lingual transfer.
Enhanced interpretability of mBERT layers through analysis.
Abstract
Multilingual BERT (mBERT), a language model pre-trained on large multilingual corpora, has impressive zero-shot cross-lingual transfer capabilities and performs surprisingly well on zero-shot POS tagging and Named Entity Recognition (NER), as well as on cross-lingual model transfer. At present, the mainstream methods to solve the cross-lingual downstream tasks are always using the last transformer layer's output of mBERT as the representation of linguistic information. In this work, we explore the complementary property of lower layers to the last transformer layer of mBERT. A feature aggregation module based on an attention mechanism is proposed to fuse the information contained in different layers of mBERT. The experiments are conducted on four zero-shot cross-lingual transfer datasets, and the proposed method obtains performance improvements on key multilingual benchmark tasks XNLI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Attention Dropout · Layer Normalization · Dropout · Dense Connections · Adam · Refunds@Expedia|||How do I get a full refund from Expedia?
