A Corpus for Large-Scale Phonetic Typology
Elizabeth Salesky, Eleanor Chodroff, Tiago Pimentel, Matthew Wiesner,, Ryan Cotterell, Alan W Black, Jason Eisner

TL;DR
VoxClamantis v1.0 is a large-scale, publicly available phonetic corpus covering 635 languages, enabling extensive phonetic typology research through aligned segments, phoneme labels, and acoustic measures.
Contribution
This work introduces the first large-scale phonetic typology corpus with aligned segments and phoneme labels across hundreds of languages, facilitating cross-linguistic phonetic analysis.
Findings
Corpus includes data for 635 languages and 690 readings.
Provides aligned segments and phoneme-level labels.
Demonstrates research potential through case studies.
Abstract
A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
