From Phonology to Syntax: Unsupervised Linguistic Typology at Different   Levels with Language Embeddings

Johannes Bjerva; Isabelle Augenstein

arXiv:1802.09375·cs.CL·February 27, 2018

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

Johannes Bjerva, Isabelle Augenstein

PDF

TL;DR

This paper develops unsupervised language embeddings across phonology, morphology, and syntax for over 800 languages, enabling typological predictions and revealing how linguistic similarities are encoded in NLP models.

Contribution

It introduces a multilingual, unsupervised approach to learn language representations at multiple typological levels, facilitating typological classification and analysis.

Findings

01

Language embeddings encode typological similarities and differences.

02

High accuracy in predicting WALS features, even for unseen language families.

03

Distinct embeddings reflect phonological, morphological, and syntactic properties.

Abstract

A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually is prohibitively time-consuming, which is in part evidenced by the fact that only 100 out of over 7,000 languages spoken in the world are fully covered in WALS. We learn distributed language representations, which can be used to predict typological properties on a massively multilingual scale. Additionally, quantitative and qualitative analyses of these language embeddings can tell us how language similarities are encoded in NLP models for tasks at different typological levels. The representations are learned in an unsupervised manner alongside tasks at three typological levels: phonology (grapheme-to-phoneme prediction, and phoneme reconstruction), morphology (morphological inflection),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.