VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving

Maoning Ge; Kento Ohtani; Yingjie Niu; Yuxiao Zhang; Kazuya Takeda

PMC · DOI:10.3390/s25196163·October 5, 2025

VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving

Maoning Ge, Kento Ohtani, Yingjie Niu, Yuxiao Zhang, Kazuya Takeda

PDF

Open Access

TL;DR

VLA-MP is a new framework for autonomous driving that combines vision, language, and physics to improve perception and action generation.

Contribution

Introduces VLA-MP, a unified framework with physics-informed action generation and language-conditioned perception for autonomous driving.

Findings

01

VLA-MP outperforms recent methods on the LangAuto benchmark with high driving scores and low infraction rates.

02

The framework successfully follows complex language instructions and adapts to dynamic environments.

03

Combining multimodal perception and physics-aware adapters improves safety and interpretability in autonomous driving.

Abstract

Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade adapter for physics-informed action generation. The system constructs structured environmental representations from RGB images and LiDAR, aligns scene features with natural language instructions through a cross-modal projector and large language model, and converts high-level semantic hidden states outputs into executable and physically consistent trajectories. Experiments on the LMDrive dataset and CARLA simulator demonstrate that VLA-MP achieves high performance across the LangAuto benchmark series, with best…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

CARLA

Diseases1

injury to

Figures7

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Autonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications