Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Zaara Zabeen Arpa; Sadnam Sakib Apurbo; Nazia Karim Khan Oishee; Ajwad Abrar

arXiv:2511.13159·cs.CL·November 18, 2025

Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new annotated Bangla corpus to distinguish between disfluency and morphological reduplication in ASR transcripts, benchmarking models to improve linguistic accuracy in low-resource language processing.

Contribution

The study presents the first publicly available Bangla corpus for disfluency and reduplication, along with benchmarking of LLMs and fine-tuned models for this task.

Findings

01

Fine-tuned BanglaBERT achieves 84.78% accuracy.

02

LLMs reach up to 82.68% accuracy with few-shot prompting.

03

The corpus enables better linguistic analysis of ASR errors in Bangla.

Abstract

Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ajwad-abrar/BanglaReDup
dataset· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling