PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Yalun Wu; Haotian Liu; Zhoujun Li; and Boyang Wang

arXiv:2604.08987·cs.AI·April 13, 2026

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Yalun Wu, Haotian Liu, Zhoujun Li, and Boyang Wang

PDF

TL;DR

PilotBench is a new benchmark for evaluating large language models on safety-critical general aviation tasks, highlighting a trade-off between semantic reasoning and numerical precision.

Contribution

It introduces PilotBench and Pilot-Score, providing a systematic evaluation framework for LLMs in physics-based, safety-critical aviation scenarios.

Findings

01

Traditional forecasters have lower MAE but lack semantic reasoning.

02

LLMs follow instructions well but have higher MAE in physics prediction.

03

Performance drops significantly in high-workload flight phases.

Abstract

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.