Using large language models for embodied planning introduces systematic safety risks
Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter, Manling Li, Fan Shi

TL;DR
This paper evaluates the safety of large language models used as planners in robotics, revealing that higher planning ability does not necessarily correlate with improved safety awareness, highlighting a key challenge for deployment.
Contribution
Introduces DESPITE, a comprehensive benchmark for safety in language-model planning, and analyzes how safety awareness and planning ability scale across models.
Findings
Planning ability improves significantly with scale.
Safety awareness remains relatively flat across models.
Larger models complete more tasks safely mainly through better planning.
Abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
