Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres, Justin Salamon

TL;DR
This paper introduces PodcastFillers, a large annotated dataset for filler word detection in podcasts, and proposes a pipeline combining VAD and ASR that achieves state-of-the-art results, facilitating future research in this area.
Contribution
The paper provides the first large-scale annotated dataset for filler words and presents a novel detection pipeline that outperforms keyword spotting methods.
Findings
The proposed pipeline achieves state-of-the-art detection accuracy.
Leveraging ASR significantly improves filler word classification.
The dataset and benchmark facilitate future research in filler word detection.
Abstract
Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem to date. A key reason is the absence of a dataset with annotated filler words for model training and evaluation. In this work, we present a novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K annotations of other sounds that commonly occur in podcasts such as breaths, laughter, and word repetitions. We propose a pipeline that leverages VAD and ASR to detect filler candidates and a classifier to distinguish between filler word types. We evaluate our proposed pipeline on PodcastFillers, compare to several baselines, and present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Radio, Podcasts, and Digital Media · Speech Recognition and Synthesis
