Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Do\u{g}ru\"oz, and En-Shiun Annie Lee

TL;DR
This paper enhances the URIEL+ linguistic knowledge base by adding script vectors, expanding language coverage, and improving feature imputation, thereby supporting better cross-lingual transfer especially for low-resource languages.
Contribution
The paper introduces new script vectors, integrates Glottolog for broader language coverage, and improves lineage imputation, significantly reducing data sparsity in URIEL+.
Findings
Feature sparsity reduced by 14%
Language coverage increased by up to 19,015 languages (1,007%)
Imputation quality improved by up to 35%
Abstract
The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity (e.g. missing feature types, incomplete language entries, and limited genealogical coverage) remains prevalent. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, we extend URIEL+ by introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These improvements reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and boost imputation quality metrics by up to 35%. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Multilingual Education and Policy
