Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Xiulin Yang, Tatsuya Aoyama, Yuekun Yao, Ethan Wilcox

TL;DR
This study investigates whether language models can differentiate between natural and impossible languages across multiple language families, revealing that they show some human-like biases but are less precise than human learners.
Contribution
The paper introduces a crosslinguistic experimental framework with new parallel corpora to test LMs on impossible languages, expanding beyond English to 12 languages from 4 families.
Findings
GPT-2 small distinguishes most attested from impossible languages.
The model does not perfectly separate all attested and impossible languages.
Perplexity scores do not distinguish attested from unattested word orders.
Abstract
Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Language and cultural evolution · Topic Modeling
