A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving

Liangdong Zhang; Yiming Nie; Haoyang Li; Fanjie Kong; Baobao Zhang; Shunxin Huang; Kai Fu; Chen Min; Liang Xiao

arXiv:2601.03519·cs.RO·January 13, 2026

A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving

Liangdong Zhang, Yiming Nie, Haoyang Li, Fanjie Kong, Baobao Zhang, Shunxin Huang, Kai Fu, Chen Min, Liang Xiao

PDF

Open Access

TL;DR

This paper introduces OFF-EMMA, an end-to-end multimodal framework with visual prompts and a chain-of-thought reasoning strategy, significantly improving off-road autonomous vehicle trajectory planning accuracy and robustness.

Contribution

The paper proposes a novel visual prompt block and COT-SC reasoning strategy to enhance spatial perception and reasoning in off-road autonomous driving models.

Findings

01

Outperforms existing methods on RELLIS-3D dataset

02

Reduces average L2 error by 13.3%

03

Decreases failure rate from 16.52% to 6.56%

Abstract

Efficient trajectory planning in off-road terrains presents a formidable challenge for autonomous vehicles, often necessitating complex multi-step pipelines. However, traditional approaches exhibit limited adaptability in dynamic environments. To address these limitations, this paper proposes OFF-EMMA, a novel end-to-end multimodal framework designed to overcome the deficiencies of insufficient spatial perception and unstable reasoning in visual-language-action (VLA) models for off-road autonomous driving scenarios. The framework explicitly annotates input images through the design of a visual prompt block and introduces a chain-of-thought with self-consistency (COT-SC) reasoning strategy to enhance the accuracy and robustness of trajectory planning. The visual prompt block utilizes semantic segmentation masks as visual prompts, enhancing the spatial understanding ability of pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms