Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise
No\"emi Aepli, Rico Sennrich

TL;DR
This paper introduces a simple data augmentation method using character-level noise to enhance zero-shot cross-lingual transfer between closely related languages and dialects, improving robustness and performance.
Contribution
It proposes a novel character-level noise augmentation technique that improves cross-lingual transfer between related languages, addressing surface similarity issues in embedding-based methods.
Findings
Consistent improvements in POS tagging and topic identification tasks
Effective across multiple language families including Finnic, Germanic, and Romance
Demonstrates the usefulness of surface-level noise in transfer learning
Abstract
Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. This work presents a simple yet effective strategy to imrove cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource source language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
