DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar; Sriram Siva; William Pickering; Kyle O'Brien; Charles Kekeh; Alexandre Ormiga Galvao Barbosa; Ang Li; Brianna Gallo Sarker; Alicia Chua; Mayur Nemade; Charun Thattai; Jiaming Di; Isaac Iyengar; Ramya Dharoor; Dino Kirouani; Jimmy Erskine; Tamir Hegazy; Scott Niekum; Usman A. Khan; Federico Pecora; Joseph W. Durham

arXiv:2508.08574·cs.RO·April 14, 2026

DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar, Sriram Siva, William Pickering, Kyle O'Brien, Charles Kekeh, Alexandre Ormiga Galvao Barbosa, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, Jiaming Di, Isaac Iyengar, Ramya Dharoor, Dino Kirouani, Jimmy Erskine, Tamir Hegazy

PDF

TL;DR

DeepFleet introduces multi-agent foundation models trained on extensive warehouse robot data, exploring various architectures for improved coordination and planning in large-scale robot fleets.

Contribution

The paper presents four novel multi-agent foundation model architectures tailored for robot fleet coordination, evaluating their design choices and scalability.

Findings

01

Robot-centric and graph-floor models perform best in prediction tasks.

02

Models benefit from larger datasets and scale effectively.

03

Localized interaction structures improve model performance.

Abstract

We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.