Grounding Driving VLA via Inverse Kinematics
Junsung Park, Hyunjung Shim

TL;DR
This paper introduces a novel inverse kinematics-inspired approach to improve visual grounding in driving trajectory models, enabling smaller models to perform comparably to much larger ones.
Contribution
It re-designs driving VLAs with a new visual prediction objective and an inverse kinematics network, significantly enhancing visual grounding and trajectory planning performance.
Findings
A 0.5B model matches larger models' performance on benchmarks.
Visual grounding improves notably in dynamic driving scenarios.
The approach suppresses reliance on ego status and textual shortcuts.
Abstract
Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
