findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

H\'ector Javier V\'azquez Mart\'inez

arXiv:2603.26292·cs.CL·March 30, 2026

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

H\'ector Javier V\'azquez Mart\'inez

PDF

TL;DR

findsylls is a versatile toolkit that standardizes and unifies syllable segmentation and embedding methods across languages, facilitating reproducible research in speech modeling.

Contribution

It introduces a modular, language-agnostic toolkit that consolidates classical and end-to-end syllabification methods for consistent evaluation and comparison.

Findings

01

Demonstrated on English, Spanish, and Kono corpora.

02

Standardized methods enable controlled comparisons.

03

Supports both high-resource and under-resourced languages.

Abstract

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.