Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
Yujing Chang, Yash Guleria, Duc-Thinh Pham, Nhut-Huy Pham, Ningli Wang, Vu N. Duong, Sameer Alam

TL;DR
This paper introduces a safety-oriented evaluation framework for language models in air traffic control, revealing that current models have limited operational reliability despite high aggregate accuracy.
Contribution
It proposes a consequence-aware evaluation method tailored to ATC, highlighting the gap between aggregate metrics and real-world safety performance of LLMs.
Findings
Current LLMs have a peak Risk Score of 0.69 on clean transcripts.
Most models score below 0.6 despite high macro-F1.
Errors are concentrated in high-impact entities, indicating grounding issues.
Abstract
Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
