Diffusion-Based Models¶
VLA Foundry includes two diffusion-based architectures: Diffusion Policy for robotics action prediction and Stable Diffusion for image generation.
Diffusion Policy¶
Type key: diffusion_policy Source: vla_foundry/models/diffusion_policy/diffusion_policy.py Params: DiffusionPolicyParams
Diffusion Policy predicts robot actions by iteratively denoising a noise vector conditioned on visual and language observations. It is the primary architecture for robotics policy training in VLA Foundry.
Architecture¶
Camera Images + Language Instruction
|
v
CLIP Backbone (frozen or trainable)
| |
v v
Visual Tokens Text Tokens
| |
+---- Concatenate ---+
|
v
Conditioning Sequence
|
+--- Noisy Action Sequence (+ timestep embedding)
|
v
Transformer Denoiser (bidirectional)
|
v
Predicted Clean Actions
Key Design Choices¶
- CLIP backbone for visual-language conditioning. The text encoder can be frozen independently of the image encoder.
- Bidirectional transformer as the denoiser (set
is_causal: falseon the transformer sub-component). - Flow matching scheduler by default (
use_flow_matching_scheduler: true), which provides faster convergence than DDPM. - Diffusion step conditioning via concatenation (
"concat") or addition ("add"). - Action and proprioception dimensions are automatically derived from
DataParams.
Config Preset¶
# config_presets/models/diffusion_policy.yaml
type: diffusion_policy
transformer:
<<: !include transformer_100m.yaml
is_causal: false
vision_language_backbone:
type: clip_backbone
hf_pretrained: openai/clip-vit-base-patch32
disable_text: false
noise_scheduler:
num_timesteps: 1000
beta_start: 0.0001
beta_end: 0.02
clamp_range: [-3, 3]
use_flow_matching_scheduler: true
Training Job Presets¶
| Preset | Task | File |
|---|---|---|
diffusion_policy_bellpepper | LBM BellPepper bimanual task | training_jobs/diffusion_policy_bellpepper.yaml |
diffusion_policy_lbm1 | LBM1 bimanual manipulation | training_jobs/diffusion_policy_lbm1.yaml |
Usage¶
torchrun --nproc_per_node=8 vla_foundry/main.py \
--config_path vla_foundry/config_presets/training_jobs/diffusion_policy_bellpepper.yaml \
--total_train_samples 30_000_000 \
--num_checkpoints 10 \
--remote_sync s3://my-bucket/diffusion_policy
Stable Diffusion¶
Type key: stable_diffusion Source: vla_foundry/models/diffusion/stable_diffusion.py Params: StableDiffusionParams
A text-conditioned latent diffusion model for image generation. Supports classifier-free guidance (CFG).
Architecture¶
Text Input
|
v
CLIP Text Encoder
|
v
Text Embeddings ---> UNet Denoiser <--- Noisy Latents + Timestep
|
v
Predicted Noise / Clean Latents
Components¶
| Component | Params | Description |
|---|---|---|
| UNet | UNetParams | The denoising backbone. Configurable channel counts per resolution level. |
| Noise Scheduler | NoiseSchedulerParams | DDPM or flow matching noise schedule. |
| CLIP | CLIPHFParams | Text encoder for conditioning. |
Config Preset¶
# Example Stable Diffusion configuration
model:
type: stable_diffusion
unet:
type: unet
in_channels: 3
out_channels: 3
time_emb_dim: 256
text_emb_dim: 512
channels: [128, 256, 512, 1024]
noise_scheduler:
num_timesteps: 1000
beta_start: 0.0001
beta_end: 0.02
clip:
type: clip_hf
hf_pretrained: openai/clip-vit-base-patch32
do_classifier_free_guidance: true
guidance_scale: 4.0
dropout_percent: 0.2
Comparison¶
| Feature | Diffusion Policy | Stable Diffusion |
|---|---|---|
| Domain | Robotics actions | Image generation |
| Input | Camera images + language | Text |
| Denoiser | Transformer | UNet |
| Conditioning | CLIP visual-language | CLIP text |
| Scheduler | Flow matching (default) | DDPM or flow matching |
| Output | Action trajectory | Generated image |