Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Yujing Chang; Yash Guleria; Duc-Thinh Pham; Nhut-Huy Pham; Ningli Wang; Vu N. Duong; Sameer Alam

arXiv:2605.11769·cs.CL·May 13, 2026

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Yujing Chang, Yash Guleria, Duc-Thinh Pham, Nhut-Huy Pham, Ningli Wang, Vu N. Duong, Sameer Alam

PDF

TL;DR

This paper introduces a safety-oriented evaluation framework for language models in air traffic control, revealing that current models have limited operational reliability despite high aggregate accuracy.

Contribution

It proposes a consequence-aware evaluation method tailored to ATC, highlighting the gap between aggregate metrics and real-world safety performance of LLMs.

Findings

01

Current LLMs have a peak Risk Score of 0.69 on clean transcripts.

02

Most models score below 0.6 despite high macro-F1.

03

Errors are concentrated in high-impact entities, indicating grounding issues.

Abstract

Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.