Config Presets¶

Config presets are reusable YAML fragments stored in vla_foundry/config_presets/. They define standard configurations for models, data pipelines, hyperparameters, and complete training jobs. Presets are composed using YAML !include directives and can be overridden by command-line arguments.

Usage¶

Presets can be loaded in two ways:

CLI includeYAML !include

--model "include vla_foundry/config_presets/models/vlm_3b.yaml"

model:
  <<: !include ../models/diffusion_policy.yaml

Command-line arguments always take precedence over values from presets.

Model Presets¶

Located in vla_foundry/config_presets/models/.

Transformers¶

From-scratch causal transformers at various scales.

Preset	Type	hidden_dim	n_layers	n_heads	max_seq_len	Notes
`transformer_tiny.yaml`	`transformer`	64	2	2	128	~3M params. Testing only.
`transformer_11m.yaml`	`transformer`	96	8	4	2048	~11M params. Quick experiments.
`transformer_100m.yaml`	`transformer`	512	12	8	2048	~100M params. Also used as Diffusion Policy denoiser.
`transformer_410m.yaml`	`transformer`	1024	24	16	2048	~410M params.
`transformer_1b.yaml`	`transformer`	2048	24	16	2048	~1B params.

Hugging Face Transformers¶

Preset	Type	Model	File
`qwen_05b.yaml`	`transformer_hf`	Qwen2.5-0.5B	`config_presets/models/qwen_05b.yaml`

Vision Models¶

Preset	Type	Description	File
`vit_tiny.yaml`	`vit`	Tiny ViT for smoke testing (4 layers, 64 dim, patch 14, 224px)	`config_presets/models/vit_tiny.yaml`
`vit_paligemma.yaml`	`vit`	PaliGemma-style SigLIP ViT (27 layers, 1152 dim, patch 14)	`config_presets/models/vit_paligemma.yaml`
`vit_smolvlm2_256m.yaml`	`vit`	SmolVLM2-style ViT (12 layers, 768 dim, patch 16, 512px, pixel shuffle 2x)	`config_presets/models/vit_smolvlm2_256m.yaml`
`vit_smolvlm2_256m_224.yaml`	`vit`	SmolVLM2-style ViT at 224px (12 layers, 768 dim, patch 16)	`config_presets/models/vit_smolvlm2_256m_224.yaml`
`unet.yaml`	`unet`	UNet for Stable Diffusion (channels: 128/256/512/1024)	`config_presets/models/unet.yaml`

Vision-Language Models¶

Preset	Type	Description	File
`vlm_11m.yaml`	`vlm`	11M VLM: tiny ViT + 11M transformer. Smoke testing.	`config_presets/models/vlm_11m.yaml`
`vlm_100m.yaml`	`vlm`	100M VLM: tiny ViT + 100M transformer	`config_presets/models/vlm_100m.yaml`
`vlm_1b.yaml`	`vlm`	1B VLM: PaliGemma ViT + 1B transformer	`config_presets/models/vlm_1b.yaml`
`vlm_3b.yaml`	`vlm`	3B VLM: PaliGemma ViT + 18-layer Transformer (2048 dim)	`config_presets/models/vlm_3b.yaml`
`vlm_3b_gemma2_2b.yaml`	`vlm`	3B VLM: PaliGemma ViT + Gemma2-2B HF backbone	`config_presets/models/vlm_3b_gemma2_2b.yaml`
`smolvlm_load_llm.yaml`	`vlm`	VLM initialized from 1B transformer + SmolVLM2 224px ViT	`config_presets/models/smolvlm_load_llm.yaml`
`paligemma_load_llm.yaml`	`vlm`	VLM initialized from 1B transformer + PaliGemma ViT	`config_presets/models/paligemma_load_llm.yaml`

Policy Models¶

Preset	Type	Description	File
`diffusion_policy.yaml`	`diffusion_policy`	CLIP backbone + 100M transformer denoiser + flow matching	`config_presets/models/diffusion_policy.yaml`

VLA Diffusion Models¶

VLA Diffusion models use a VLM backbone (loaded from checkpoint) with a diffusion transformer head for action prediction.

Preset	Type	Description	File
`vla_diffusion_11m.yaml`	`diffusion_policy`	VLM backbone + tiny transformer denoiser. Smoke testing.	`config_presets/models/vla_diffusion_11m.yaml`
`vla_diffusion_100m.yaml`	`diffusion_policy`	VLM backbone + 11M transformer denoiser	`config_presets/models/vla_diffusion_100m.yaml`
`vla_diffusion_1b.yaml`	`diffusion_policy`	VLM backbone + 100M transformer denoiser	`config_presets/models/vla_diffusion_1b.yaml`
`vla_diffusion_paligemma2.yaml`	`diffusion_policy`	PaliGemma2-3B HF backbone + 100M transformer denoiser	`config_presets/models/vla_diffusion_paligemma2.yaml`

Data Presets¶

Located in vla_foundry/config_presets/data/.

Base Data Configurations¶

Preset	Type	Description	File
`diffusion_policy.yaml`	`robotics`	Base robotics data params for Diffusion Policy (CLIP processor, 1 past + 14 future timesteps)	`config_presets/data/diffusion_policy.yaml`
`vla_diffusion.yaml`	`robotics`	VLA Diffusion data params (PaliGemma2 processor, 224px, 256 img tokens, seq_len 2048)	`config_presets/data/vla_diffusion.yaml`

LBM (Large Behavior Model)¶

Stored in config_presets/data/lbm/.

Preset	Description	File
`lbm_data_params.yaml`	Base LBM robotics data (bimanual Panda, 6 cameras, proprioception + action fields)	`data/lbm/lbm_data_params.yaml`
`lbm_action_fields.yaml`	LBM action field definitions	`data/lbm/lbm_action_fields.yaml`
`lbm_data_camera_names_4cameras.yaml`	4-camera configuration	`data/lbm/lbm_data_camera_names_4cameras.yaml`
`lbm_data_camera_names_6cameras.yaml`	6-camera configuration	`data/lbm/lbm_data_camera_names_6cameras.yaml`
`lbm_language_annotations.yaml`	Language instruction type configuration	`data/lbm/lbm_language_annotations.yaml`
`lbm_image_augmentation_params.yaml`	Image augmentation settings	`data/lbm/lbm_image_augmentation_params.yaml`
`lbm_data_discard_key.yaml`	Keys to discard from the dataset	`data/lbm/lbm_data_discard_key.yaml`

Preprocessing Parameters¶

Preset	Description	File
`robotics_preprocessing_params_1past_14future.yaml`	Standard 1 past + 14 future timesteps	`data/robotics_preprocessing_params_1past_14future.yaml`
`robotics_preprocessing_params_5past_20future_lbmsize.yaml`	5 past + 20 future timesteps, 342x256 images	`data/robotics_preprocessing_params_5past_20future_lbmsize.yaml`

Hyperparameter Presets¶

Located in vla_foundry/config_presets/hparams/.

Preset	Description	Key Settings	File
`diffusion_policy.yaml`	Diffusion Policy hparams	`lr: 5e-4`, `loss: mse`, `grad_clip: 1.0`, `lr_cooldown_end: 1e-5`	`hparams/diffusion_policy.yaml`

Training Job Presets¶

Located in vla_foundry/config_presets/training_jobs/. These are complete experiment configurations that compose model, data, and hparam presets with task-specific overrides.

Preset	Model	Task	File
`diffusion_policy_bellpepper.yaml`	Diffusion Policy	LBM BellPepper bimanual manipulation	`training_jobs/diffusion_policy_bellpepper.yaml`
`diffusion_policy_lbm1.yaml`	Diffusion Policy	LBM1 bimanual manipulation (full config)	`training_jobs/diffusion_policy_lbm1.yaml`
`lbm_hparams_4cams.yaml`	LBM	4-camera hparam configuration	`training_jobs/lbm_hparams_4cams.yaml`
`lbm_hparams_6cams.yaml`	LBM	6-camera hparam configuration	`training_jobs/lbm_hparams_6cams.yaml`
`lbm_multitask_4cams.yaml`	Diffusion Policy	LBM multitask 4-camera with 410M transformer, EMA	`training_jobs/lbm_multitask_4cams.yaml`
`vla_diffusion_bellpepper.yaml`	VLA Diffusion	BellPepper task with PaliGemma2 VLM backbone	`training_jobs/vla_diffusion_bellpepper.yaml`
`vla_diffusion_tiny_test.yaml`	VLA Diffusion	Tiny VLA diffusion for smoke testing (local data)	`training_jobs/vla_diffusion_tiny_test.yaml`

Anatomy of a Training Job Preset¶

A training job preset composes presets from other categories:

# training_jobs/diffusion_policy_bellpepper.yaml
model:
  <<: !include ../models/diffusion_policy.yaml        # Model preset
  vision_language_backbone:
    type: clip_backbone
    hf_pretrained: openai/clip-vit-base-patch32
    freeze_text_encoder: True
  transformer:
    <<: !include ../models/transformer_100m.yaml       # Nested model preset
    is_causal: True

data:
  <<: !include ../data/lbm/lbm_data_params.yaml       # Robot-specific fields
  <<: !include ../data/diffusion_policy.yaml           # Base data params
  dataset_manifest:                                     # Task-specific data
    - s3://bucket/BimanualPutRedBellPepperInBin/manifest.jsonl
  dataset_statistics:
    - s3://bucket/BimanualPutRedBellPepperInBin/stats.json
  dataset_modality:
    - robotics
  dataset_weighting:
    - 1.0

distributed:
  fsdp: True

hparams:
  <<: !include ../hparams/diffusion_policy.yaml        # Hparam preset
  per_gpu_batch_size: 16
  global_batch_size: 128

Directory Structure¶

vla_foundry/config_presets/
|-- models/
|   |-- transformer_tiny.yaml
|   |-- transformer_11m.yaml
|   |-- transformer_100m.yaml
|   |-- transformer_410m.yaml
|   |-- transformer_1b.yaml
|   |-- qwen_05b.yaml
|   |-- vit_tiny.yaml
|   |-- vit_paligemma.yaml
|   |-- vit_smolvlm2_256m.yaml
|   |-- vit_smolvlm2_256m_224.yaml
|   |-- vlm_11m.yaml
|   |-- vlm_100m.yaml
|   |-- vlm_1b.yaml
|   |-- vlm_3b.yaml
|   |-- vlm_3b_gemma2_2b.yaml
|   |-- smolvlm_load_llm.yaml
|   |-- paligemma_load_llm.yaml
|   |-- unet.yaml
|   |-- diffusion_policy.yaml
|   |-- vla_diffusion_11m.yaml
|   |-- vla_diffusion_100m.yaml
|   |-- vla_diffusion_1b.yaml
|   +-- vla_diffusion_paligemma2.yaml
|-- data/
|   |-- diffusion_policy.yaml
|   |-- vla_diffusion.yaml
|   |-- robotics_preprocessing_params_1past_14future.yaml
|   |-- robotics_preprocessing_params_5past_20future_lbmsize.yaml
|   |-- lbm/
|   +-- preprocessing/
|-- hparams/
|   +-- diffusion_policy.yaml
+-- training_jobs/
    |-- diffusion_policy_bellpepper.yaml
    |-- diffusion_policy_lbm1.yaml
    |-- lbm_hparams_4cams.yaml
    |-- lbm_hparams_6cams.yaml
    |-- lbm_multitask_4cams.yaml
    |-- vla_diffusion_bellpepper.yaml
    +-- vla_diffusion_tiny_test.yaml

Creating Your Own Presets

The easiest way to start a new experiment is to copy an existing training job preset and modify the dataset paths, camera names, and field definitions. The model and hparam presets can be reused as-is in most cases.