Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman; Md Arifur Rahman; Niamul Hassan Samin; Abdullah Ibne Hanif Arean; Juena Ahmed Noshin

arXiv:2603.06828·cs.CV·March 10, 2026

Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models

Md Ashikur Rahman, Md Arifur Rahman, Niamul Hassan Samin, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin

PDF

Open Access

TL;DR

This paper shows that models maintaining temporally grounded beliefs during long-horizon vision-language tasks generalize better out-of-distribution, with step-level grounding quality serving as a key predictor of robustness.

Contribution

It introduces the Step Grounding Rate (SGR) as a measurable indicator of behavioral faithfulness and demonstrates its strong correlation with out-of-distribution robustness across multiple models and benchmarks.

Findings

01

SGR predicts out-of-distribution retention with r=0.83

02

Grounding quality varies significantly within parameter-matched models

03

Counterfactual and cross-architecture tests confirm the importance of visual reliance

Abstract

We uncover a behavioral law of long-horizon vision-language models: models that maintain temporally grounded beliefs generalize better. Standard benchmarks measure only final-answer accuracy, which obscures how models use visual information; a model can guess correctly while its step-by-step reasoning is entirely unanchored to the visual input. We formalize this as behavioral faithfulness over long horizons, an empirically measurable property that quantifies whether a model's intermediate reasoning remains consistent with the evolving visual state. Across eight models on three long-horizon benchmarks, we demonstrate that temporal grounding quality is a leading indicator of robustness: the Step Grounding Rate (SGR) predicts out-of-distribution retention with $r = 0.83$ (permutation test $p = 0.003$ ), a relationship that holds within capacity-matched models and cannot be explained by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning