Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats
Arne Rubehn, Jessica Nieder, Robert Forkel, Johann-Mattis List

TL;DR
This paper introduces a dynamic method to generate binary feature vectors for speech sounds from the International Phonetic Alphabet, enabling large-scale cross-linguistic comparisons and machine learning applications.
Contribution
It presents a novel approach to create comprehensive feature vectors for all sounds in the CLTS standard, addressing missing data issues in cross-linguistic phonetic analysis.
Findings
Effective comparison of speech sound similarities across languages.
Supports large multilingual datasets with over 2,000 language varieties.
Potential to enhance cross-linguistic machine learning applications.
Abstract
When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
