Grounding Driving VLA via Inverse Kinematics

Junsung Park; Hyunjung Shim

arXiv:2605.21061·cs.CV·May 21, 2026

Grounding Driving VLA via Inverse Kinematics

Junsung Park, Hyunjung Shim

PDF

TL;DR

This paper introduces a novel inverse kinematics-inspired approach to improve visual grounding in driving trajectory models, enabling smaller models to perform comparably to much larger ones.

Contribution

It re-designs driving VLAs with a new visual prediction objective and an inverse kinematics network, significantly enhancing visual grounding and trajectory planning performance.

Findings

01

A 0.5B model matches larger models' performance on benchmarks.

02

Visual grounding improves notably in dynamic driving scenarios.

03

The approach suppresses reliance on ego status and textual shortcuts.

Abstract

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.