PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
Yalun Wu, Haotian Liu, Zhoujun Li, and Boyang Wang

TL;DR
PilotBench is a new benchmark for evaluating large language models on safety-critical general aviation tasks, highlighting a trade-off between semantic reasoning and numerical precision.
Contribution
It introduces PilotBench and Pilot-Score, providing a systematic evaluation framework for LLMs in physics-based, safety-critical aviation scenarios.
Findings
Traditional forecasters have lower MAE but lack semantic reasoning.
LLMs follow instructions well but have higher MAE in physics prediction.
Performance drops significantly in high-workload flight phases.
Abstract
As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
