NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models
Ziyue Zhu, Shangyang Wu, Shuai Zhao, Zhiqiu Zhao, Shengjie Li, Yi Wang, Fang Li, Haoran Luo

TL;DR
This paper introduces NS-VLA, a neuro-symbolic framework for vision-language-action tasks in robotics, combining symbolic encoding and reinforcement learning to improve data efficiency, generalization, and exploration.
Contribution
It presents a novel neuro-symbolic approach with online RL for VLA models, enhancing data efficiency, generalization, and primitive reuse in robotic manipulation.
Findings
Outperforms previous methods in one-shot training
Demonstrates superior zero-shot generalization
Achieves high data efficiency and exploration expansion
Abstract
Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
