QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding; Han Zhao; Wenjie Zhang; Wenxuan Song; Min Zhang,; Siteng Huang; Ningxi Yang; Donglin Wang

arXiv:2312.14457·cs.RO·February 5, 2025·1 cites

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang,, Siteng Huang, Ningxi Yang, Donglin Wang

PDF

Open Access

TL;DR

This paper introduces QUAR-VLA, a vision-language-action framework for quadruped robots that integrates perception, planning, and decision-making to improve autonomous interaction and task execution.

Contribution

It presents a novel integrated paradigm and a transformer-based model, QUART, along with a large-scale dataset, QUARD, for training and evaluating vision-language-action tasks in quadruped robots.

Findings

01

Achieved high performance in diverse robotic tasks

02

Enabled emergent capabilities in quadruped robots

03

Validated approach with 4000 evaluation trials

Abstract

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout · Layer Normalization · Byte Pair Encoding · Adam