GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

Md Selim Sarowar; Omer Tariq; and Sungho Kim

arXiv:2603.09079·cs.CV·March 11, 2026

GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

Md Selim Sarowar, Omer Tariq, and Sungho Kim

PDF

Open Access

TL;DR

GST-VLA introduces structured 3D Gaussian spatial tokens and depth-aware reasoning to improve vision-language-action models, achieving state-of-the-art results on complex benchmarks.

Contribution

The paper presents a novel Gaussian Spatial Tokenizer and depth-aware reasoning framework, enhancing geometric understanding in vision-language-action models.

Findings

01

Achieves 96.4% on LIBERO benchmark.

02

Improves performance on SimplerEnv by 5.4%.

03

Ablation studies confirm the effectiveness of each component.

Abstract

VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_{g} = 128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ \in R^{3}$ , log-scale covariance $lo g σ \in R^{3}$ , and learned opacity $α \in (0, 1)$ . The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition