findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
H\'ector Javier V\'azquez Mart\'inez

TL;DR
findsylls is a versatile toolkit that standardizes and unifies syllable segmentation and embedding methods across languages, facilitating reproducible research in speech modeling.
Contribution
It introduces a modular, language-agnostic toolkit that consolidates classical and end-to-end syllabification methods for consistent evaluation and comparison.
Findings
Demonstrated on English, Spanish, and Kono corpora.
Standardized methods enable controlled comparisons.
Supports both high-resource and under-resourced languages.
Abstract
Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
