Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
Ehsaneddin Asgari, Mohammad R.K. Mofrad

TL;DR
This paper introduces WELD, a novel quantitative measure based on word embeddings, to compare and analyze linguistic and genetic languages, revealing meaningful clustering and differences across diverse language families and genomes.
Contribution
The paper presents WELD, a new divergence measure for languages using word embeddings, applied to natural and genetic languages, demonstrating its effectiveness in language classification and genetic comparison.
Findings
Languages within the same family tend to cluster together.
Significant differences are observed between human/animal and plant genetic languages.
WELD effectively distinguishes language and genetic similarities and differences.
Abstract
We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
