Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Gautam Kumar Jain, Carsten Markgraf, Julian St\"ahler

TL;DR
This paper investigates cross-stage coherence in hierarchical driving VQA, comparing explicit prompt-based conditioning and implicit gated context projectors, to improve reasoning consistency in autonomous driving models.
Contribution
It introduces and evaluates two complementary mechanisms—prompt-based conditioning and gated context projectors—for enhancing cross-stage reasoning in driving VQA without extensive retraining.
Findings
Explicit prompt conditioning reduces NLI contradiction by 42.6%.
Implicit gated projectors achieve a 34% reduction in planning-stage NLI contradiction.
Planning language quality improves with gated projectors, but lexical and structural consistency degrade.
Abstract
Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
