VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat; Sedrick Keh; Kushal Arora; Isabella Huang; Paarth Shah; Haruki Nishimura; Shun Iwase; Katherine Liu

arXiv:2604.19728·cs.RO·April 22, 2026

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu

PDF

1 Repo 8 Models

TL;DR

VLA Foundry is an open-source framework that unifies training for vision-language-action models, supporting from-scratch and pretrained backbones, and demonstrating competitive performance on manipulation tasks.

Contribution

It introduces a shared training stack for VLA models, enabling end-to-end training and evaluation, with released code and models for community use.

Findings

01

From-scratch trained model matches prior closed-source performance.

02

Using Qwen3-VL backbone improves multi-task manipulation policy.

03

Framework simplifies training and evaluation of VLA models.

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TRI-ML/vla_foundry
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.