OccLLaMA: An Occupancy-Language-Action Generative World Model for   Autonomous Driving

Julong Wei; Shanshuai Yuan; Pengfei Li; Qingda Hu; Zhongxue Gan,; Wenchao Ding

arXiv:2409.03272·cs.CV·September 6, 2024·2 cites

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan,, Wenchao Ding

PDF

Open Access

TL;DR

OccLLaMA introduces a unified generative world model for autonomous driving that leverages semantic occupancy, vision, language, and action modalities to improve scene understanding, prediction, and planning.

Contribution

It proposes a novel occupancy-language-action generative model with a scene tokenizer and a unified multi-modal vocabulary, enabling multi-task autonomous driving capabilities.

Findings

01

Achieves competitive performance in 4D occupancy forecasting

02

Effective in motion planning tasks

03

Demonstrates versatility in visual question answering

Abstract

The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Management and Algorithms · Human Motion and Animation

MethodsLLaMA