A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
Xiaoda Yang, Shuai Yang, Can Wang, Jingyang Xue, Menglan Tang, Checheng Yu, Xunzhe Zhou, Sashuai Zhou, Tao Jin, Lixin Yang, Xiangyu Yue, Zhou Zhao

TL;DR
This paper introduces a progressive training method for vision-language models that significantly reduces spatio-temporal reasoning hallucinations and improves dynamic reasoning capabilities.
Contribution
It develops a new Chain-of-Thought dataset and a progressive training framework combining supervised and weakly-labeled data to enhance reasoning accuracy.
Findings
Reduces forward-backward performance gap from over 70% to 6.53%.
Improves backbone accuracy in spatiotemporal reasoning tasks.
Develops models with more authentic dynamic reasoning abilities.
Abstract
Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
