Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin

TL;DR
This paper shows that models maintaining temporally grounded beliefs during long-horizon vision-language tasks generalize better out-of-distribution, with step-level grounding quality serving as a key predictor of robustness.
Contribution
It introduces the Step Grounding Rate (SGR) as a measurable indicator of behavioral faithfulness and demonstrates its strong correlation with out-of-distribution robustness across multiple models and benchmarks.
Findings
SGR predicts out-of-distribution retention with r=0.83
Grounding quality varies significantly within parameter-matched models
Counterfactual and cross-architecture tests confirm the importance of visual reliance
Abstract
We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with (permutation test ), a relationship that holds within capacity-matched models and cannot be explained by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
