Documentation Home
A framework for training Vision-Language-Action models. Train LLMs, VLMs, and VLAs — all in one place with pure PyTorch.
Multiple Modalities¶
Train with text, image-captions, or robotics data. Go from LLM to VLM to VLA using the same framework — no external dependencies.
Multi-Node Training¶
Built on FSDP2 with WebDataset streaming. Multi-GPU training works locally with torchrun and on large clusters with AWS SageMaker.
Dataset Mixing¶
Specify dataset sources and ratios at dataloader time for easy dataset mixing and batch balancing across modalities.
Modular Design¶
Pure PyTorch implementation with no heavy external libraries. Easy to modify, extend, and add new models or data pipelines.
Hugging Face Support¶
Load pretrained weights from Hugging Face for LLMs, VLMs, CLIP models, and more. Use native or HF-backed implementations.
Registry System¶
Self-registering models and batch handlers via decorators. Add new architectures without touching core code.
Quick Overview¶
# Train a model
torchrun --nproc_per_node=8 vla_foundry/main.py \
--model "include vla_foundry/config_presets/models/vlm_3b.yaml" \
--data.type image_caption \
--total_train_samples 14_000_000
# Load and use a trained model
from vla_foundry.params.train_experiment_params import load_params_from_yaml
from vla_foundry.models import create_model
from vla_foundry.utils import load_model_checkpoint
model_params = load_params_from_yaml(ModelParams, "path/to/config.yaml")
model = create_model(model_params)
load_model_checkpoint(model, "path/to/checkpoint.pt")
What's Inside¶
| Section | Description |
|---|---|
| Getting Started | Install VLA Foundry and run your first training job |
| Concepts | Understand the architecture, config system, and data format |
| Guides | Step-by-step tutorials for common workflows |
| Reference | Detailed parameter and model API reference |
| Examples | Copy-paste-ready CLI scripts for training, preprocessing, and visualization |