# Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

**Authors:** Jinzhou Tang, Jusheng zhang, Sidi Liu, Waikit Xiu, Qinhan Lv, Xiying Li

arXiv: 2509.00210 · 2025-09-03

## TL;DR

This paper introduces VEME, a cross-modal alignment framework that enhances embodied AI models' spatio-temporal reasoning and generalization in dynamic environments through geometry-aware world modeling.

## Contribution

It proposes a novel approach combining cross-modal alignment, a dynamic cognitive map, and instruction-based navigation to improve reasoning and planning in embodied models.

## Key findings

- Achieves 1-3% accuracy improvement on VSI-Bench and VLN-CE.
- Enhances generalization in unseen scenes and dynamic environments.
- Improves exploration efficiency and long-term planning capabilities.

## Abstract

Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their limitations in spatio-temporal reasoning and adaptation to dynamic, open-set tasks like task-oriented navigation and embodied question answering (EQA) persist due to inadequate modeling of fine-grained spatio-temporal cues and physical world comprehension. To address this, we propose VEME, a novel cross-modal alignment method that enhances generalization in unseen scenes by learning an ego-centric, experience-centered world model. Our framework integrates three key components: (1) a cross-modal alignment framework bridging objects, spatial representations, and visual semantics with spatio-temporal cues to enhance VLM in-context learning; (2) a dynamic, implicit cognitive map activated by world embedding to enable task-relevant geometric-semantic memory recall; and (3) an instruction-based navigation and reasoning framework leveraging embodied priors for long-term planning and efficient exploration. By embedding geometry-aware spatio-temporal episodic experiences, our method significantly improves reasoning and planning in dynamic environments. Experimental results on VSI-Bench and VLN-CE demonstrate 1%-3% accuracy and exploration efficiency improvement compared to traditional approaches.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00210/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00210/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/2509.00210/full.md

---
Source: https://tomesphere.com/paper/2509.00210