Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo, Valtteri Skantsi, Samuel R\"onnqvist, Saara Hellstr\"om,, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo and, Veronika Laippala

TL;DR
This paper investigates zero-shot cross-lingual transfer and lightweight monolingual classification for register detection in web documents across multiple languages, introducing new corpora and demonstrating strong model performance.
Contribution
It introduces two new register-annotated corpora for French and Swedish and shows that zero-shot transfer from English and lightweight monolingual models are highly effective.
Findings
Zero-shot transfer from English can match or outperform monolingual models.
Lightweight monolingual classifiers can achieve comparable results with minimal training data.
Certain registers remain challenging for cross-lingual transfer.
Abstract
We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
