BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding

TL;DR
BEVLM introduces a novel framework that combines spatially consistent Bird's-Eye View representations with Large Language Models, enhancing semantic reasoning and driving performance in autonomous systems.
Contribution
This work bridges the gap between BEV spatial representations and LLMs by distilling semantic knowledge, enabling more effective multi-view reasoning in autonomous driving.
Findings
LLMs with BEV features improve reasoning accuracy by 46%.
Distilling LLM knowledge into BEV enhances safety in critical scenarios by 29%.
BEVLM achieves better spatial and semantic integration for autonomous driving.
Abstract
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
