Artificial Disfluency Detection, Uh No, Disfluency Generation for the   Masses

T. Passali; T. Mavropoulos; G. Tsoumakas; G. Meditskos; S.; Vrochidis

arXiv:2211.09235·cs.CL·November 18, 2022

Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses

T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S., Vrochidis

PDF

Open Access

TL;DR

This paper introduces LARD, a novel method for generating realistic artificial disfluencies from fluent text, which enhances training data for disfluency detection models especially when annotated data is scarce.

Contribution

LARD is the first approach to automatically generate all types of disfluencies using contextual embeddings, enabling effective training without annotated disfluent datasets.

Findings

01

LARD produces realistic disfluencies that improve detection accuracy.

02

Using LARD-generated data enhances model performance with limited real data.

03

LARD effectively simulates various disfluency types in context-aware manner.

Abstract

Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic context-aware artificial disfluencies. Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data. Our empirical evaluation demonstrates that LARD can indeed be effectively used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques