DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji, Song, Jiashi Feng, Gao Huang

TL;DR
DeeR-VLA introduces a dynamic early-exit framework for multimodal large language models in robotics, enabling resource-efficient inference by adaptively adjusting model size based on task demands, thus reducing computation and memory use.
Contribution
The paper proposes a multi-exit architecture with novel early-termination algorithms for resource-efficient robotic vision-language models, maintaining performance while reducing hardware demands.
Findings
Achieves 5.2-6.5x reduction in computational costs
Reduces GPU memory usage by 2-6x
Maintains competitive performance on CALVIN robot benchmark
Abstract
MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · AI-based Problem Solving and Planning
