On Barriers to Archival Audio Processing
Peter Sullivan, Muhammad Abdul-Mageed

TL;DR
This paper evaluates the robustness of modern speech processing tools on archival mid-20th century radio recordings, highlighting strengths in language identification but vulnerabilities in speaker recognition due to biases.
Contribution
It provides an empirical assessment of current LID and SR methods on historical recordings, revealing their capabilities and limitations in archival contexts.
Findings
LID systems like Whisper handle multilingual and accented speech well
Speaker embeddings are sensitive to channel, age, and language biases
Archival SR methods need improvement for reliable speaker indexing
Abstract
In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
