DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for   Efficient Robot Execution

Yang Yue; Yulin Wang; Bingyi Kang; Yizeng Han; Shenzhi Wang; Shiji; Song; Jiashi Feng; Gao Huang

arXiv:2411.02359·cs.RO·November 5, 2024·2 cites

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji, Song, Jiashi Feng, Gao Huang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

DeeR-VLA introduces a dynamic early-exit framework for multimodal large language models in robotics, enabling resource-efficient inference by adaptively adjusting model size based on task demands, thus reducing computation and memory use.

Contribution

The paper proposes a multi-exit architecture with novel early-termination algorithms for resource-efficient robotic vision-language models, maintaining performance while reducing hardware demands.

Findings

01

Achieves 5.2-6.5x reduction in computational costs

02

Reduces GPU memory usage by 2-6x

03

Maintains competitive performance on CALVIN robot benchmark

Abstract

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yueyang130/deer-vla
pytorchOfficial

Models

🤗
Yang130/DeeR-VLA
model· ♡ 2
♡ 2

Videos

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · AI-based Problem Solving and Planning