QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang,, Siteng Huang, Ningxi Yang, Donglin Wang

TL;DR
This paper introduces QUAR-VLA, a vision-language-action framework for quadruped robots that integrates perception, planning, and decision-making to improve autonomous interaction and task execution.
Contribution
It presents a novel integrated paradigm and a transformer-based model, QUART, along with a large-scale dataset, QUARD, for training and evaluating vision-language-action tasks in quadruped robots.
Findings
Achieved high performance in diverse robotic tasks
Enabled emergent capabilities in quadruped robots
Validated approach with 4000 evaluation trials
Abstract
The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout · Layer Normalization · Byte Pair Encoding · Adam
