Training Examples¶
This page walks through the annotated training scripts in the examples/training/ directory. Each example shows the full command with explanations for key flags.
Key Flags Reference¶
Before diving into examples, here is a summary of the most important training flags:
| Flag | Description |
|---|---|
--total_train_samples | Total number of training samples to process across the entire run. Training ends when this count is reached. |
--num_checkpoints | Number of evenly-spaced checkpoints to save during training. |
--hparams.per_gpu_batch_size | Micro-batch size per GPU. Determines memory usage per device. |
--hparams.global_batch_size | Effective batch size across all GPUs and gradient accumulation steps. Must be a multiple of per_gpu_batch_size * nproc_per_node * nnodes. |
--remote_sync | S3 path where checkpoints and logs are uploaded. |
--config_path | Path to a YAML config preset that sets all parameters for a training job. CLI flags override preset values. |
--model "include ..." | Inline include of a YAML model config preset. |
--distributed.fsdp | Enable Fully Sharded Data Parallel (FSDP2) training. |
--wandb | Enable Weights & Biases logging. |
--resolve_configs | Print the fully resolved config and optionally save it (useful for debugging). |
LLM Training (transformer_11m) — Minimal Example¶
The simplest training script. Trains an 11M-parameter transformer on text data using 3 GPUs.
Source: examples/training/llm_11m.sh
.venv/bin/torchrun --nproc_per_node=3 --nnodes=1 vla_foundry/main.py \
--model.type transformer \ # (1)!
--model "include vla_foundry/config_presets/models/transformer_11m.yaml" \ # (2)!
--model.cast_output_to_float32 True \ # (3)!
--distributed.fsdp True \ # (4)!
--data.type text \ # (5)!
--data.dataset_manifest ["s3://your-bucket/your-path/your_llm_pretraining_dataset/manifest.jsonl"] \ # (6)!
--data.dataset_modality ["text"] \
--data.dataset_weighting [1.0] \ # (7)!
--data.seq_len 2048 \ # (8)!
--total_train_samples 14_000_000 \ # (9)!
--num_checkpoints 5 \ # (10)!
--hparams.per_gpu_batch_size 32 \ # (11)!
--hparams.global_batch_size 96 \ # (12)!
--remote_sync s3://your-bucket/your-path/vla_foundry_scratch/models/llm_11m \ # (13)!
--resolve_configs True \ # (14)!
--resolve_configs_path ./ \
"$@"
- Model architecture type —
transformerfor native LLM. - Include a YAML preset that defines the 11M-param transformer config (layers, hidden dim, heads, etc.).
- Cast the output logits to float32 for numerically stable loss computation.
- Enable FSDP2 for distributed training across GPUs.
- Data modality —
textfor pre-tokenized text data. - S3 path to the WebDataset manifest file listing all tar shards.
- Dataset weight — used for mixing multiple datasets;
1.0means this dataset gets 100% of samples. - Sequence length in tokens per sample.
- Stop training after processing 14 million samples total.
- Save 5 evenly-spaced checkpoints over the course of training.
- Each GPU processes 32 samples per micro-batch.
- Effective batch size is 96 (= 32 per GPU × 3 GPUs, no gradient accumulation here).
- Upload checkpoints and logs to this S3 path.
- Print the fully resolved config for debugging; save to
./resolved_config.yaml.
VLM Training with PaliGemma¶
Train a 3B Vision-Language Model using a PaliGemma processor and image-caption data.
Source: examples/training/vlm_paligemma3b.sh
export TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
.venv/bin/torchrun --nproc_per_node=3 --nnodes=1 vla_foundry/main.py \
--model "include vla_foundry/config_presets/models/vlm_3b.yaml" \ # (1)!
--distributed.fsdp True \
--data.type image_caption \ # (2)!
--data.processor google/paligemma-3b-pt-224 \ # (3)!
--data.dataset_manifest ["s3://your-bucket/your-path/datasets/datacompdr_1b/manifest.jsonl"] \
--data.dataset_modality ["image_caption"] \
--data.dataset_weighting [1.0] \
--data.seq_len 2048 \
--data.img_num_tokens 256 \ # (4)!
--total_train_samples 14_000_000 \
--num_checkpoints 5 \
--hparams.per_gpu_batch_size 1 \ # (5)!
--hparams.global_batch_size 3 \
--remote_sync s3://your-bucket/your-path/vla_foundry_scratch/models/vlm_paligemma_3b \
"$@"
- The
vlm_3b.yamlpreset defines the full VLM architecture including the vision encoder (ViT) and language model. image_captiondata type expects WebDataset shards with image-text pairs.- The processor handles image tokenization and text tokenization using the PaliGemma tokenizer.
- Number of image tokens per image — PaliGemma uses 256 tokens for 224×224 images.
- With a 3B model, per-GPU batch size is typically small to fit in memory.
VLM Training Initialized from a Pretrained LLM¶
Train a SmolVLM-style VLM where the language-model tower is initialized from a pretrained LLM checkpoint instead of random weights.
Source: examples/training/vlm_smolvlm_full_fromllm.sh
.venv/bin/torchrun --nproc_per_node=8 --nnodes=1 vla_foundry/main.py \
--model "include vla_foundry/config_presets/models/smolvlm_load_llm.yaml" \ # (1)!
--model.transformer.resume_from_checkpoint s3://your-bucket/your-path/your_pretrained_llm_run/checkpoints/checkpoint_N.pt \ # (2)!
--model.transformer.resume_weights_only True \ # (3)!
--distributed.fsdp True \
--data.type image_caption \
--data.processor HuggingFaceTB/SmolVLM2-256M-Video-Instruct \
--data.dataset_manifest ["s3://your-bucket/your-path/datasets/datacompdr_1b/manifest.jsonl"] \
--data.dataset_modality ["image_caption"] \
--data.dataset_weighting [1.0] \
--data.seq_len 2048 \
--data.img_num_tokens 64 \ # (4)!
--data.image_size 224 \
--model.vit.img_size 224 \
--model.vit.patch_size 14 \
--model.vit.projector_pixel_shuffle_factor 2 \
--total_train_samples 50_000_000 \
--num_checkpoints 10 \
--hparams.per_gpu_batch_size 8 \
--hparams.global_batch_size 512 \
--hparams.torchcompile True \
--remote_sync s3://your-bucket/your-path/vla_foundry_scratch/models/vlm_smolvlm_fromllm_samples50m
- The
smolvlm_load_llm.yamlpreset is configured to expect an external LLM checkpoint for the transformer tower. - Point at a previously-trained LLM checkpoint to initialize the language tower weights.
resume_weights_only=Truemeans the optimizer state, scheduler, and training step are not restored — only the weights. This is the transfer-learning pattern (as opposed to resuming an interrupted run).- SmolVLM uses fewer image tokens (64) at 224×224 with patch 14 and pixel-shuffle factor 2 — smaller image-token budget than PaliGemma's 256.
VLA DiffusionPolicy (PaliGemma2 backbone)¶
Train a vision-language-action DiffusionPolicy with a PaliGemma2 backbone on the BimanualPutRedBellPepperInBin robotics task.
Source: examples/training/vla_diffusion_redbellpepper_paligemma2.sh
.venv/bin/torchrun --nproc_per_node=2 --nnodes=1 vla_foundry/main.py \
--hparams.torchcompile True \
--config_path vla_foundry/config_presets/training_jobs/vla_diffusion_bellpepper.yaml \ # (1)!
--remote_sync s3://your-bucket/your-path/vla_foundry/model_checkpoints/vla_diffusion \
--num_checkpoints 5 \
--total_train_samples 1000 \
"$@"
- The
vla_diffusion_bellpepper.yamlpreset bundles the DiffusionPolicy head, the PaliGemma2 VLM backbone, the Spartan data pipeline, and the LBM hyperparameters.
VLA DiffusionPolicy (Qwen3-VL-2B-Thinking backbone)¶
Same VLA recipe, but swap in a Qwen3-VL-2B-Thinking backbone at the CLI:
Source: examples/training/vla_diffusion_redbellpepper_qwen_2b_thinking.sh
uv run torchrun --nproc_per_node=1 --nnodes=1 vla_foundry/main.py \
--hparams.torchcompile False \
--config_path vla_foundry/config_presets/training_jobs/vla_diffusion_bellpepper.yaml \
--remote_sync s3://your-bucket/your-path/model_checkpoints/vla_diffusion \
--num_checkpoints 5 \
--total_train_samples 1000 \
--model.vision_language_backbone.hf_pretrained "Qwen/Qwen3-VL-2B-Thinking" \ # (1)!
--data.processor "Qwen/Qwen3-VL-2B-Thinking" \ # (2)!
"$@"
- Override the backbone from the YAML preset by pointing
hf_pretrainedat a different HF model. - The data processor must match the backbone's tokenizer/image processor.
Diffusion Policy Training with config_path¶
Train a standalone DiffusionPolicy (no VLM) for robotics using a single YAML config preset.
Source: examples/training/diffusion_policy.sh
.venv/bin/torchrun --nproc_per_node=2 --nnodes=1 vla_foundry/main.py \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \ # (1)!
--remote_sync s3://your-bucket/your-path/vla_foundry/model_checkpoints/diffusion_policy \
--num_checkpoints 5 \
--total_train_samples 100000 \
"$@"
- The
config_pathflag loads a complete training job config — model architecture, data pipeline, hyperparameters, and all. CLI flags like--total_train_samplesoverride preset values.
Note
When using --config_path, you typically only need to specify overrides on the command line. The YAML preset contains all default values for the model, data, and hyperparameters.
Finetuning from a Pretrained Checkpoint¶
Resume training from an existing checkpoint, loading only the model weights (not the optimizer state or training step).
Source: examples/training/resume.sh
.venv/bin/torchrun --nproc_per_node=8 --nnodes=1 vla_foundry/main.py \
--config_path vla_foundry/config_presets/training_jobs/lbm_hparams_4cams.yaml \
--remote_sync s3://your-bucket/model_checkpoints/diffusion_policy \
--num_checkpoints 5 \
--total_train_samples 1_000_000 \
--data.dataset_manifest "['s3://...manifest.jsonl']" \
--data.dataset_statistics "['s3://...stats.json']" \
--data.dataset_modality "['robotics']" \
--data.dataset_weighting "[1.0]" \
--model.resume_from_checkpoint "s3://...checkpoint_3_converted.pt" \ # (1)!
--model.resume_weights_only True \ # (2)!
--hparams.per_gpu_batch_size 128 \
--hparams.global_batch_size 1024
- S3 path to a previously saved checkpoint file. The model weights are loaded from this checkpoint.
- When
True, only model weights are loaded — the optimizer state, learning rate schedule, and training step counter are not restored. This is useful for finetuning on a new dataset or task with a fresh optimizer.
Full resume vs. weight-only resume
- Full resume (
resume_weights_only=False): Restores everything (weights, optimizer, scheduler, step). Use this to continue a training run that was interrupted. - Weight-only resume (
resume_weights_only=True): Only restores model weights. Use this for finetuning or transfer learning where you want a fresh training state.
Running on SageMaker¶
For cluster-scale training, wrap any of the above commands with sagemaker/launch_training.py. The launcher builds and pushes a Docker image, then submits the job to SageMaker via estimator.fit().
uv run --group sagemaker sagemaker/launch_training.py \
--sagemaker.user firstname.lastname \
--sagemaker.instance_count 1 \
--sagemaker.instance_type p5en \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \
--remote_sync s3://your-bucket/your-path/model_checkpoints/diffusion_policy
See sagemaker/README.md for the full list of --sagemaker.* options and prerequisites.
Secrets file required
Create a secrets.env file in the project root with your WANDB_API_KEY and HF_TOKEN before launching SageMaker jobs.