Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

Yang Wu; Qiang Meng; Zhaojiang Liu; Youquan Liu; Jian Yang; Jin Xie

arXiv:2605.21139·cs.CV·May 21, 2026

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

PDF

TL;DR

This paper introduces CoPhy, a reinforcement learning framework for autonomous driving that combines cognitive understanding and foresight of physical consequences to improve safety and interpretability.

Contribution

It presents a novel dual infrastructure with a distilled cognitive module and a predictive world model, enabling safer and more flexible autonomous driving policies.

Findings

01

Achieves state-of-the-art results on NAVSIM benchmarks.

02

Enables safer driving through cognitive scene compliance.

03

Supports flexible intent control via language commands.

Abstract

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.