LBM Pipeline¶
This guide walks through the end-to-end process for LBM (Learned Behavior Models): preprocessing Spartan data, training a diffusion policy, and finetuning from pretrained weights.
Preprocessing¶
VLA Foundry uses the script vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py to convert raw Spartan data into webdataset tar shards.
Ray parallelization
The preprocessing scripts use Ray for parallelization. Ray can run either locally or on EC2 nodes. See the Data Preprocessing guide for instructions on setting up and using Ray.
Dependencies
Preprocessing may require additional dependencies. Install them with:
Example Preprocessing Command¶
First, download the raw Spartan data for one or more tasks from the public registry (no credentials needed — uses https://tri-ml-public.s3.amazonaws.com/). Run --list to see all available public tasks.
python vla_foundry/data/scripts/download_dataset.py --task BimanualPutRedBellPepperInBin --local_path /tmp/raw_data --raw
python vla_foundry/data/scripts/download_dataset.py --task BimanualPlaceAppleFromBowlIntoBin --local_path /tmp/raw_data --raw
python vla_foundry/data/scripts/download_dataset.py --task BimanualStackPlatesOnTableFromDryingRack --local_path /tmp/raw_data --raw
Then preprocess. --source_episodes accepts a list of directories so you can merge multiple tasks into one output dataset (per-task language annotations are looked up from --language_annotations_path):
python vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py \
--type "spartan" \
--source_episodes "[
'/tmp/raw_data/tasks/BimanualPutRedBellPepperInBin/',
'/tmp/raw_data/tasks/BimanualPlaceAppleFromBowlIntoBin/',
'/tmp/raw_data/tasks/BimanualStackPlatesOnTableFromDryingRack/',
]" \
--output_dir /tmp/preprocessed/lbm_multitask/ \
--camera_names "include vla_foundry/config_presets/data/lbm/lbm_data_camera_names_4cameras.yaml" \
--language_annotations_path vla_foundry/config_presets/data/lbm/lbm_language_annotations.yaml \
--action_fields_config_path vla_foundry/config_presets/data/lbm/lbm_action_fields.yaml \
--data_discard_keys "include vla_foundry/config_presets/data/lbm/lbm_data_discard_key.yaml" \
--samples_per_shard 100 \
--config_path "vla_foundry/config_presets/data/robotics_preprocessing_params_1past_14future.yaml"
Swap the local /tmp/... paths for s3://... URIs to read/write remotely at scale.
After preprocessing completes, check the output_dir folder. There should be a shards subfolder containing the files required for training (manifest, stats, and tar shards).
Training¶
The preprocessed dataset is referenced through the dataset_manifest and dataset_statistics fields inside the config at config_path.
.venv/bin/torchrun --nproc_per_node=2 --nnodes=1 vla_foundry/main.py \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \
--remote_sync s3://your-bucket/your-path/model_checkpoints/diffusion_policy \
--num_checkpoints 5 \
--total_train_samples 100000
Resolving Configs¶
Use the --resolve_configs and --resolve_configs_path flags to inspect the fully resolved configuration before running training.
- Setting
--resolve_configs=Trueprints the resolved config to stdout. - Setting
--resolve_configs_pathsaves the resolved config to{resolve_configs_path}/resolved_config.yaml.
.venv/bin/torchrun --nproc_per_node=2 --nnodes=1 vla_foundry/main.py \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \
--remote_sync s3://your-bucket/your-path/model_checkpoints/diffusion_policy \
--num_checkpoints 5 \
--total_train_samples 100000 \
--resolve_configs True \
--resolve_configs_path ./
SageMaker Training¶
For training on SageMaker, use the sagemaker/launch_training.py script. Its argument parser wraps the vla_foundry/main.py parser, so all training arguments can be reused. SageMaker-specific arguments use the sagemaker. prefix:
uv run --group=sagemaker sagemaker/launch_training.py \
--sagemaker.user firstname.lastname \
--sagemaker.instance_count 1 \
--sagemaker.instance_type p4de \
# copy-paste other training arguments here as-is, e.g., --data.something
Note
Use uv run --group=sagemaker to launch this script. There is no need for torchrun when using SageMaker. See sagemaker/README.md for the full list of options.
Finetuning from Pretrained Weights¶
To finetune a model from a pretrained checkpoint instead of training from scratch, use the --model.resume_from_checkpoint and --model.resume_weights_only flags. This loads only the model weights without resuming optimizer state or training progress.
uv run --group=sagemaker sagemaker/launch_training.py \
--sagemaker.user firstname.lastname \
--sagemaker.instance_count 1 \
--sagemaker.instance_type p4de \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \
--remote_sync s3://your-bucket/your-path/model_checkpoints/finetuned \
--model.resume_from_checkpoint s3://your-bucket/your-path/vla_foundry/model_checkpoints/diffusion_policy/ablations/multitask/100m/2026_01_07-23_38_39-model_diffusion_policy-lr_5e-05-bsz_1024/checkpoints/checkpoint_3.pt \
--model.resume_weights_only True
Config-based approach
It is usually preferable to place resume_from_checkpoint and resume_weights_only directly into the config file referenced by --config_path rather than passing them on the command line.
Key details:
- The checkpoint path is validated early in training to fail fast with a clear error if the path is invalid.
- Use
--model.resume_weights_only=False(the default) to fully resume training including optimizer state.