SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual   Question Answering for Autonomous Driving

Peiru Zheng; Yun Zhao; Zhan Gong; Hong Zhu; Shaohua Wu

arXiv:2407.21293·cs.CV·August 1, 2024

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu

PDF

Open Access

TL;DR

This paper introduces SimpleLLM4AD, an end-to-end vision-language model for autonomous driving that uses graph-structured visual question answering to integrate perception, prediction, planning, and behavior stages.

Contribution

It presents a novel graph-based VQA framework that enables language-based reasoning across all autonomous driving stages, integrating vision transformers and large language models.

Findings

01

Achieves competitive performance in complex driving scenarios

02

Effectively integrates multi-stage reasoning with VQA and graph structures

03

Demonstrates the potential of language models in autonomous driving

Abstract

Many fields could benefit from the rapid development of the large language models (LLMs). The end-to-end autonomous driving (e2eAD) is one of the typically fields facing new opportunities as the LLMs have supported more and more modalities. Here, by utilizing vision-language model (VLM), we proposed an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Each stage consists of several visual question answering (VQA) pairs and VQA pairs interconnect with each other constructing a graph called Graph VQA (GVQA). By reasoning each VQA pair in the GVQA through VLM stage by stage, our method could achieve e2e driving with language. In our method, vision transformers (ViT) models are employed to process nuScenes visual data, while VLM are utilized to interpret and reason about the information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques