WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Stefan Englmeier; Katharina Winter; Fabian B. Flohr

arXiv:2603.14497·cs.CV·March 18, 2026

WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Stefan Englmeier, Katharina Winter, Fabian B. Flohr

PDF

Open Access

TL;DR

WorldVLM is a hybrid model that combines vision-language reasoning with environmental dynamics prediction to improve autonomous driving decision-making and scene understanding.

Contribution

It introduces a novel hybrid architecture unifying vision-language models with world models for enhanced autonomous driving capabilities.

Findings

01

VLMs provide strong contextual reasoning but limited spatial understanding.

02

WMs effectively predict environmental dynamics and scene evolution.

03

The hybrid approach enables interpretable, context-aware driving actions.

Abstract

Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis