Joint vs Sequential Speaker-Role Detection and Automatic Speech   Recognition for Air-traffic Control

Alexander Blatt; Aravind Krishnan; Dietrich Klakow

arXiv:2406.13842·cs.CL·June 21, 2024

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

Alexander Blatt, Aravind Krishnan, Dietrich Klakow

PDF

Open Access

TL;DR

This paper introduces a transformer-based joint system for automatic speech recognition and speaker role detection in air-traffic control, outperforming traditional cascaded methods in certain scenarios.

Contribution

It presents a novel joint ASR-SRD transformer architecture that integrates both tasks into a single model, improving performance over separate approaches.

Findings

01

Joint system outperforms cascaded approaches in specific cases.

02

Acoustic and lexical differences impact architecture performance.

03

Strategies to mitigate these differences are proposed.

Abstract

Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing