VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat*†, Sedrick Keh*†, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu
*Co-first authors
Core contributors

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two families of models: the first trained fully from scratch through our LLM→VLM→VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin.

Use VLA Foundry to train an LLM, then a VLM, and finally a VLA—all in one place

Pipeline diagram part 1

Bootstrap off pre-trained models from Hugging Face

Pipeline diagram part 2

Discover what's included

🎯 Multiple Modalities

Train with text, image-captions, or robotics data. Go from LLM to VLM to VLA using the same framework.

⚡ Multi-Node Training

Built on FSDP2 with WebDataset streaming. Multi-GPU training works locally with torchrun and on large clusters with AWS SageMaker.

🔀 Dataset Mixing

Specify dataset sources and ratios at dataloader time for easy dataset mixing and batch balancing across modalities.

🔧 Modular Design

Pure PyTorch implementation with no heavy external libraries. Easy to modify, extend, and add new models or data pipelines.

🤗 Hugging Face Support

Load pretrained weights from Hugging Face for LLMs, VLMs, CLIP models, and more. Use native or HF-backed implementations.

📋 Registry System

Self-registering models and batch handlers via decorators. Add new architectures without touching core code.

Download and use our open LLM, VLM, and VLA models

Pre-trained models

See VLA results in action

Example success and failure rollouts from our Foundry-QwenVLA-2.5B model evaluated in LBM Eval, a challenging open-source simulation benchmark.

More Resources

BibTeX Citation

Technical Report

@techreport{mercat2026vlafoundry,
  title       = {{VLA Foundry}: A Unified Framework for Training Vision-Language-Action Models},
  author      = {Mercat, Jean and Keh, Sedrick and Arora, Kushal and Huang, Isabella and Shah, Paarth and Nishimura, Haruki and Iwase, Shun and Liu, Katherine},
  year        = {2026},
  institution = {Toyota Research Institute},
  note        = {Jean Mercat and Sedrick Keh contributed equally}
}

Software

@software{mercat2026vlafoundry_code,
  title   = {{VLA Foundry}: A Unified Framework for Training Vision-Language-Action Models},
  author  = {Mercat, Jean and Keh, Sedrick and Arora, Kushal and Huang, Isabella and Shah, Paarth and Nishimura, Haruki and Iwase, Shun and Liu, Katherine},
  year    = {2026},
  url     = {https://github.com/TRI-ML/vla_foundry},
  version = {1.0.0}
}