The Influence of Dataset Partitioning on Dysfluency Detection Systems

Sebastian P. Bayerl; Dominik Wagner; Elmar N\"oth; Tobias Bocklet; and; Korbinian Riedhammer

arXiv:2206.03400·eess.AS·October 31, 2022

The Influence of Dataset Partitioning on Dysfluency Detection Systems

Sebastian P. Bayerl, Dominik Wagner, Elmar N\"oth, Tobias Bocklet, and, Korbinian Riedhammer

PDF

1 Repo

TL;DR

This study examines how different dataset partitioning strategies affect the performance evaluation of dysfluency detection systems, highlighting dataset biases and proposing new splits for more reliable assessment.

Contribution

It introduces new data splits and an extended dataset to improve the evaluation of dysfluency detection methods and addresses dataset bias issues.

Findings

01

Performance varies significantly with different data splits.

02

The original dataset is dominated by few speakers, affecting evaluation.

03

Proposed new splits enable more robust and fair assessment.

Abstract

This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to detect dysfluencies. We train and evaluate the systems with different non-speaker-exclusive and speaker-exclusive splits of the Stuttering Events in Podcasts (SEP-28k) dataset to shed some light on the variability of results w.r.t. to the partition method used. Furthermore, we show that the SEP-28k dataset is dominated by only a few speakers, making it difficult to evaluate. To remedy this problem, we created SEP-28k-Extended (SEP-28k-E), containing semi-automatically generated speaker and gender information for the SEP-28k corpus, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

th-nuernberg/ml-stuttering-events-dataset-extended
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.