VLA Foundry - Vision-Language-Action Model Training Framework

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat^*†, Sedrick Keh^*†, Kushal Arora^†, Isabella Huang^†, Paarth Shah^†, Haruki Nishimura, Shun Iwase, Katherine Liu^†

^*Co-first authors
^†Core contributors

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two families of models: the first trained fully from scratch through our LLM→VLM→VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin.

Use VLA Foundry to train an LLM, then a VLM, and finally a VLA—all in one place

Bootstrap off pre-trained models from Hugging Face

Discover what's included

🎯 Multiple Modalities

Train with text, image-captions, or robotics data. Go from LLM to VLM to VLA using the same framework.

⚡ Multi-Node Training

Built on FSDP2 with WebDataset streaming. Multi-GPU training works locally with torchrun and on large clusters with AWS SageMaker.

🔀 Dataset Mixing

Specify dataset sources and ratios at dataloader time for easy dataset mixing and batch balancing across modalities.

🔧 Modular Design

Pure PyTorch implementation with no heavy external libraries. Easy to modify, extend, and add new models or data pipelines.

🤗 Hugging Face Support

Load pretrained weights from Hugging Face for LLMs, VLMs, CLIP models, and more. Use native or HF-backed implementations.

📋 Registry System

Self-registering models and batch handlers via decorators. Add new architectures without touching core code.

Download and use our open LLM, VLM, and VLA models

See VLA results in action

Example success and failure rollouts from our Foundry-QwenVLA-2.5B model evaluated in LBM Eval, a challenging open-source simulation benchmark.

Bimanual Place Apple From Bowl Into Bin

Bimanual Place Fruit From Bowl Into Bin

Bimanual Put Red Bell Pepper In Bin

Bimanual Put Spatula On Plate From Drying Rack

Bimanual Put Spatula On Plate From Table

Bimanual Stack Plates On Table From Drying Rack

Bimanual Store Cereal Box Under Shelf

Place Cup By Coaster

Push Coaster To Center Of Table

Push Coaster To Mug

Put Banana On Saucer

Put Kiwi In Center Of Table

Put Mug On Saucer

Put Spatula In Utensil Crock

Turn Cup Upside Down

Turn Mug Rightside Up

BibTeX Citation

Technical Report

@techreport{mercat2026vlafoundry,
  title       = {{VLA Foundry}: A Unified Framework for Training Vision-Language-Action Models},
  author      = {Mercat, Jean and Keh, Sedrick and Arora, Kushal and Huang, Isabella and Shah, Paarth and Nishimura, Haruki and Iwase, Shun and Liu, Katherine},
  year        = {2026},
  institution = {Toyota Research Institute},
}

Software

@software{mercat2026vlafoundry_code,
  title   = {{VLA Foundry}: A Unified Framework for Training Vision-Language-Action Models},
  author  = {Mercat, Jean and Keh, Sedrick and Arora, Kushal and Huang, Isabella and Shah, Paarth and Nishimura, Haruki and Iwase, Shun and Liu, Katherine},
  year    = {2026},
  url     = {https://github.com/TRI-ML/vla_foundry},
  version = {1.0.0}
}