# VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving

**Authors:** Maoning Ge, Kento Ohtani, Yingjie Niu, Yuxiao Zhang, Kazuya Takeda

PMC · DOI: 10.3390/s25196163 · 2025-10-05

## TL;DR

VLA-MP is a new framework for autonomous driving that combines vision, language, and physics to improve perception and action generation.

## Contribution

Introduces VLA-MP, a unified framework with physics-informed action generation and language-conditioned perception for autonomous driving.

## Key findings

- VLA-MP outperforms recent methods on the LangAuto benchmark with high driving scores and low infraction rates.
- The framework successfully follows complex language instructions and adapts to dynamic environments.
- Combining multimodal perception and physics-aware adapters improves safety and interpretability in autonomous driving.

## Abstract

Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade adapter for physics-informed action generation. The system constructs structured environmental representations from RGB images and LiDAR, aligns scene features with natural language instructions through a cross-modal projector and large language model, and converts high-level semantic hidden states outputs into executable and physically consistent trajectories. Experiments on the LMDrive dataset and CARLA simulator demonstrate that VLA-MP achieves high performance across the LangAuto benchmark series, with best driving scores of 44.3, 63.5, and 78.4 on LangAuto, LangAuto-Short, and LangAuto-Tiny, respectively, while maintaining high infraction scores of 0.89–0.95, outperforming recent VLA methods such as LMDrive and AD-H. Visualization and video results further validate the framework’s ability to follow complex language-conditioned instructions, adapt to dynamic environments, and prioritize safety. These findings highlight the potential of combining multimodal perception, language reasoning, and physics-aware adapters for robust and interpretable autonomous driving.

## Full-text entities

- **Diseases:** injury to (MESH:D014947)
- **Chemicals:** CARLA (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12526522/full.md

---
Source: https://tomesphere.com/paper/PMC12526522