Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses
T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S., Vrochidis

TL;DR
This paper introduces LARD, a novel method for generating realistic artificial disfluencies from fluent text, which enhances training data for disfluency detection models especially when annotated data is scarce.
Contribution
LARD is the first approach to automatically generate all types of disfluencies using contextual embeddings, enabling effective training without annotated disfluent datasets.
Findings
LARD produces realistic disfluencies that improve detection accuracy.
Using LARD-generated data enhances model performance with limited real data.
LARD effectively simulates various disfluency types in context-aware manner.
Abstract
Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic context-aware artificial disfluencies. Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data. Our empirical evaluation demonstrates that LARD can indeed be effectively used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
