HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu,, Chengming Shi, Jianyu Chen

TL;DR
HiRT introduces a hierarchical transformer approach that balances computational efficiency and performance in robotic control, enabling real-time interaction and improved success rates in dynamic tasks.
Contribution
This work presents HiRT, a novel hierarchical transformer framework that reduces reliance on high-frequency VLMs, improving dynamic task performance and computational efficiency in robotic control.
Findings
Doubling control frequency in static tasks with maintained success rates.
Increasing success rate from 48% to 75% in real-world dynamic manipulation tasks.
Significant improvements over baseline methods in simulation and real-world experiments.
Abstract
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Robot Manipulation and Learning · Robotic Path Planning Algorithms
MethodsDense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
