data2lang2vec: Data Driven Typological Features Completion

Hamidreza Amirzadeh; Sadegh Jafari; Anika Harju; Rob van der Goot

arXiv:2409.17373·cs.CL·September 27, 2024

data2lang2vec: Data Driven Typological Features Completion

Hamidreza Amirzadeh, Sadegh Jafari, Anika Harju, Rob van der Goot

PDF

Open Access 1 Repo

TL;DR

This paper presents a data-driven approach to complete linguistic typological features using textual data, significantly improving coverage and accuracy over previous methods in multilingual NLP applications.

Contribution

It introduces a multilingual POS tagger and a novel evaluation setup to predict missing typological features more accurately using textual data.

Findings

01

Achieved over 70% accuracy in POS tagging across 1,749 languages.

02

Outperformed previous feature prediction methods in coverage and accuracy.

03

Demonstrated the effectiveness of external statistical features and machine learning algorithms.

Abstract

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70\% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hamid-amir/data_lang2vec
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques