QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning
Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu

TL;DR
This paper introduces QUART-Online, a latency-free multimodal language model for quadruped robots that achieves real-time inference and significantly improves task success rates by integrating vision, language, and compressed actions.
Contribution
The paper presents a novel latency-free MLLM model with Action Chunk Discretization, enabling real-time inference without performance loss during action instruction tuning.
Findings
Achieves real-time inference aligned with controller frequency.
Boosts task success rate by 65%.
Maintains language model performance despite latency reduction techniques.
Abstract
This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsSparse Evolutionary Training
