Topic Model Robustness to Automatic Speech Recognition Errors in Podcast   Transcripts

Raluca Alexandra Fetic; Mikkel Jordahn; Lucas Chaves Lima; Rasmus Arpe; Fogh Egeb{\ae}k; Martin Carsten Nielsen; Benjamin Biering; Lars Kai Hansen

arXiv:2109.12306·cs.IR·September 28, 2021

Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts

Raluca Alexandra Fetic, Mikkel Jordahn, Lucas Chaves Lima, Rasmus Arpe, Fogh Egeb{\ae}k, Martin Carsten Nielsen, Benjamin Biering, Lars Kai Hansen

PDF

Open Access

TL;DR

This paper investigates how robust Latent Dirichlet Allocation topic models are when applied to automatic speech recognition transcripts of Danish podcasts, showing that high-quality topics can still be extracted despite transcription errors.

Contribution

It demonstrates the resilience of LDA topic modeling to ASR errors in low-resource language transcripts, providing insights for improved content relevance in multilingual podcast services.

Findings

01

Topic embeddings remain high quality despite increasing transcription noise.

02

Cosine similarity scores decrease with more ASR errors but stay informative.

03

Robustness of LDA supports use of automatic transcripts for content analysis.

Abstract

For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Music and Audio Processing

Methodstravel james