LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

Zhuoling Li; Xiaogang Xu; Zhenhua Xu; SerNam Lim; Hengshuang Zhao

arXiv:2405.17424·cs.CV·February 6, 2025

LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

Zhuoling Li, Xiaogang Xu, Zhenhua Xu, SerNam Lim, Hengshuang Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces LARM, a large auto-regressive model that combines the efficiency of RL with the generalization of large language models, enabling long-horizon embodied tasks like Minecraft without human intervention.

Contribution

LARM is a lightweight yet powerful auto-regressive model that uses a giant LLM referee to address reward vanishment in long-horizon tasks, advancing embodied intelligence.

Findings

01

LARM successfully completes complex Minecraft tasks requiring long decision chains.

02

The model outperforms prior methods in long-horizon embodied exploration.

03

The approach effectively mitigates reward vanishment in reinforcement learning.

Abstract

Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings