Automatic register identification for the open web using multilingual deep learning
Erik Henriksson, Amanda Myntti, Saara Hellstr\"om, Anni Eskelinen, Selcen Erten-Johansson, Veronika Laippala

TL;DR
This study develops multilingual deep learning models to identify web text registers across 16 languages, achieving high accuracy and revealing challenges like inherent ambiguity and hybrid texts, with models outperforming monolingual approaches.
Contribution
Introduces the Multilingual CORE corpora and demonstrates effective multilingual deep learning models for register classification at scale.
Findings
Best model achieves 79% F1 across languages.
Data pruning improves F1 to over 90%.
Performance drops on unseen languages, highlighting cross-lingual challenges.
Abstract
This article presents multilingual deep learning models for identifying web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain over 72,000 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Using multi-label classification, our best model achieves 79% F1 averaged across languages, matching or exceeding previous studies that used simpler classification schemes. This demonstrates that models can perform well even with a complex register scheme at multilingual scale. However, we observe a consistent performance ceiling across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsPruning
