GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
Md Selim Sarowar, Omer Tariq, and Sungho Kim

TL;DR
GST-VLA introduces structured 3D Gaussian spatial tokens and depth-aware reasoning to improve vision-language-action models, achieving state-of-the-art results on complex benchmarks.
Contribution
The paper presents a novel Gaussian Spatial Tokenizer and depth-aware reasoning framework, enhancing geometric understanding in vision-language-action models.
Findings
Achieves 96.4% on LIBERO benchmark.
Improves performance on SimplerEnv by 5.4%.
Ablation studies confirm the effectiveness of each component.
Abstract
VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean , log-scale covariance , and learned opacity . The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition
