From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou

TL;DR
FALCON introduces a novel approach that incorporates rich 3D spatial priors into vision-language-action models, significantly improving spatial reasoning, transferability, and alignment for real-world tasks using RGB data alone.
Contribution
The paper presents FALCON, a new paradigm that injects 3D spatial tokens into the action head, leveraging spatial foundation models without retraining or architectural changes.
Findings
Achieves state-of-the-art performance across benchmarks.
Robust under clutter and spatial prompts.
Outperforms competitive baselines in real-world tasks.
Abstract
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is very well-written and easy to follow - The model design and ideas are clearly novel and overall make sense - The problem setting is interesting and important — to utilize 3D information when available, but be robust in cases when it is not available. - The ablations are adequate and clearly show that the model is robust to various modality scenarios and can benefit form 3D data, when available.
My main concern is that the evaluation benchmarks chosen, especially the simulation ones are weak and do not necessarily benefit from 3D understanding and information. For instance, the paper makes this statement: “FALCON surpasses previous methods that rely on ground-truth point clouds (e.g., 3DDP (Ze et al., 2024) and 3D Diffuser Actor (Ke et al., 2024)), improving the Avg. Len. by 4.13 and 1.05, respectively. This provides clear evidence of the effectiveness of our implicit spatial informat
1. This paper focus on spatial reasoning in VLAs, which is a crucial research problem. 2. The experiments cover multiple benchmarks on both simulation and real-world. 3. Clear qualitative visualizations are provided to support the contribution. 4. The pipeline is complete, which combines 2D VLM backbone, embodied spatial model for 3D encoding, and spatial-enhanced action head for fusion.
1. The core concept of this paper “3D spatial tokens” is ill-defined. This paper claims that 3D spatial tokens provide robust geometric priors. However, in line 132, the depth information is optional, and in line 224, the depth and/or pose is randomly injected. The resulting spatial tokens Tspl derives from DINO and an encoder with cross/self-attention. Thus, these spatial tokens are not truly 3D, they are 2D correlations. 2. In Eq. 4, they randomly inject depth and/or pose. The proposed stochas
### Overall Overall I feel the paper makes a useful empirical contribution -- especially if weights and code are released. * Strong empirical results * Straightforward method (this is a good thing!) with clear use case * Clear description of the approach * Good ablations of different design decisions * Clear figures Please don't take the length of this section an indicator. Mainly, these are the strengths that I'm seeing -- hopefully these are the same things the authors see too.
### Presentation I found the paper a bit hard to read in its current form, and I feel the paper's impact may be limited by the writing and presentation. Many parts of the paper could be tightened through removing unnecessary adjectives and reorganizing sections. **Organization:** In the current version, even basic information about the training approach, including the losses, learning algorithm, and datasets are left entirely to the appendix. In contrast, the method section focuses almost exc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
