ModelParams¶

ModelParams is the polymorphic base class for all model configurations. It uses draccus ChoiceRegistry so that the concrete subclass is selected at runtime based on the type field.

Source: vla_foundry/params/model_params.py

Base Fields¶

Every ModelParams subclass inherits these fields:

Field	Type	Default	Description
`type`	`str`	`None`	Registry key that selects the concrete subclass. Auto-populated from the registered key.
`resume_from_checkpoint`	`str`	`None`	Path to a checkpoint file to resume from.
`resume_weights_only`	`bool`	`False`	If `True`, load only model weights (skip optimizer state, step counter, etc.).
`freeze`	`bool`	`False`	If `True`, freeze all parameters in this module (no gradient updates).

TransformerParams¶

Type key: transformer

A from-scratch causal transformer (GPT-style) with configurable normalization, FFN type, and positional embeddings.

Field	Type	Default	Description
`norm_type`	`str`	`"default_layer_norm"`	Normalization layer type.
`ffn_type`	`str`	`"swiglu"`	Feed-forward network type.
`qk_norm`	`bool`	`False`	Apply normalization to query and key projections.
`positional_embedding_type`	`str`	`"rotary"`	Positional embedding type (e.g., `"rotary"`, `"learned"`).
`attn_name`	`str`	`"torch_attn"`	Attention implementation.
`hidden_dim`	`int`	`96`	Model hidden dimension.
`n_layers`	`int`	`8`	Number of transformer layers.
`n_heads`	`int`	`4`	Number of attention heads.
`vocab_size`	`int`	`50432`	Vocabulary size for the embedding layer.
`post_embed_norm`	`bool`	`False`	Apply layer norm after the embedding layer.
`norm_eps`	`float`	`1e-5`	Epsilon for normalization layers.
`weight_tying`	`bool`	`False`	Tie input embedding and output projection weights.
`cast_output_to_float32`	`bool`	`False`	Cast the output logits to float32 (useful for mixed-precision stability).
`max_seq_len`	`int`	`2048`	Maximum sequence length.
`is_causal`	`bool`	`True`	Use causal (autoregressive) attention masking.

TransformerHFParams¶

Type key: transformer_hf

Loads a transformer from a Hugging Face pretrained checkpoint. The hidden_dim and vocab_size are dynamically read from the HF config.

Field	Type	Default	Description
`hf_pretrained`	`str`	`None`	Hugging Face model identifier (e.g., `"Qwen/Qwen2.5-0.5B"`).

Dynamic properties:

Property	Source	Description
`hidden_dim`	`AutoConfig.hidden_size`	Retrieved from the HF model config on first access.
`vocab_size`	`AutoConfig.vocab_size`	Retrieved from the HF model config on first access.

ViTParams¶

Type key: vit

A from-scratch Vision Transformer for image encoding.

Field	Type	Default	Description
`pretrained`	`str`	`None`	Path or identifier for pretrained ViT weights.
`interpolation_mode`	`str`	`"bicubic"`	Interpolation mode for positional embedding resizing.
`hidden_dim`	`int`	`768`	Hidden dimension.
`inter_dim`	`int`	`3072`	Intermediate (FFN) dimension.
`patch_size`	`int`	`16`	Patch size in pixels.
`img_size`	`int`	`384`	Input image size in pixels.
`n_heads`	`int`	`12`	Number of attention heads.
`dropout`	`float`	`0.0`	Dropout rate.
`n_layers`	`int`	`12`	Number of transformer layers.
`ln_eps`	`float`	`1e-6`	Layer norm epsilon.
`cls_flag`	`bool`	`False`	Include a CLS token.
`projector_pixel_shuffle_factor`	`int`	`1`	Pixel shuffle factor for the output projector. A factor of 2 reduces the token count by 4x.

ViTHFParams¶

Type key: vit_hf

A Hugging Face-backed ViT. Inherits from TransformerHFParams.

Field	Type	Default	Description
`hf_pretrained`	`str`	`None`	Hugging Face model identifier. (Inherited from TransformerHFParams)
`hidden_dim`	`int`	`768`	Hidden dimension (overrides the dynamic property from `TransformerHFParams`).
`projector_pixel_shuffle_factor`	`int`	`1`	Pixel shuffle factor for the output projector.

VLMParams¶

Type key: vlm

A Vision-Language Model that composes a ViT image encoder with a transformer language model.

Field	Type	Default	Description
`vit`	`Union[ViTParams, ViTHFParams]`	`ViTParams()`	Image encoder configuration.
`transformer`	`Union[TransformerParams, TransformerHFParams]`	`TransformerParams()`	Language model configuration.
`image_token_id`	`int`	`None`	Token ID for image placeholder tokens. Auto-derived from the processor if not set.
`processor`	`str`	`None`	HF processor identifier (e.g., `"google/paligemma-3b-pt-224"`). Inherited from `DataParams.processor` if not set.

Shared Attributes

VLMParams.init_shared_attributes automatically resolves image_token_id and vocab_size from the data processor/tokenizer when possible, so you rarely need to set these manually.

VLMHFParams¶

Type key: vlm_hf

Loads an entire VLM from a single Hugging Face checkpoint (e.g., PaliGemma, Qwen-VL).

Field	Type	Default	Description
`hf_pretrained`	`str`	`None`	Hugging Face model identifier.

DiffusionPolicyParams¶

Type key: diffusion_policy

A diffusion-based policy for robotics action prediction. Combines a vision-language backbone for conditioning with a transformer denoiser.

Field	Type	Default	Description
`vision_language_backbone`	`Union[VLMBackboneParams, CLIPBackboneParams, VLMFoundryBackboneParams, ViTBackboneParams]`	`CLIPBackboneParams()`	Visual and language conditioning backbone.
`transformer`	`Union[TransformerParams, TransformerHFParams]`	`ModelParams()`	Denoising transformer backbone.
`noise_scheduler`	`NoiseSchedulerParams`	`NoiseSchedulerParams()`	Noise schedule for the diffusion process.
`use_diffusers_scheduler`	`bool`	`False`	Use a HuggingFace Diffusers scheduler implementation.
`use_flow_matching_scheduler`	`bool`	`False`	Use flow matching instead of DDPM diffusion.
`input_noise_std`	`float`	`0.0`	Standard deviation of Gaussian noise added to inputs.
`diffusion_step_conditioning`	`Literal["add", "concat"]`	`"concat"`	How the diffusion timestep is injected into the transformer.
`action_dim`	`int`	`None`	Shared. Auto-set from `DataParams.action_dim`.
`proprioception_dim`	`int`	`0`	Shared. Auto-set from `DataParams.proprioception_dim`.

StableDiffusionParams¶

Type key: stable_diffusion

A Stable Diffusion model for image generation, with optional classifier-free guidance.

Field	Type	Default	Description
`unet`	`UNetParams`	`UNetParams()`	UNet architecture configuration.
`noise_scheduler`	`NoiseSchedulerParams`	`NoiseSchedulerParams()`	Noise schedule configuration.
`use_diffusers_unet`	`bool`	`False`	Use a HuggingFace Diffusers UNet.
`use_diffusers_scheduler`	`bool`	`False`	Use a HuggingFace Diffusers scheduler.
`use_flow_matching_scheduler`	`bool`	`False`	Use flow matching instead of DDPM.
`clip`	`CLIPHFParams`	`CLIPHFParams()`	CLIP text encoder configuration.
`do_classifier_free_guidance`	`bool`	`False`	Enable classifier-free guidance.
`guidance_scale`	`float`	`4.0`	CFG guidance scale.
`dropout_percent`	`float`	`0.2`	Conditioning dropout rate for unconditional training (CFG).

Supporting Parameter Classes¶

UNetParams¶

Type key: unet

Field	Type	Default	Description
`in_channels`	`int`	`3`	Input channels.
`out_channels`	`int`	`3`	Output channels.
`time_emb_dim`	`int`	`256`	Timestep embedding dimension.
`text_emb_dim`	`int`	`512`	Text conditioning embedding dimension.
`channels`	`List[int]`	`[]`	Channel counts per UNet level (e.g., `[128, 256, 512, 1024]`).
`image_size`	`int`	`128`	Spatial resolution of the UNet input.
`time_mlp_float32`	`bool`	`False`	Run the time MLP in float32.

NoiseSchedulerParams¶

Type key: noise_scheduler

Field	Type	Default	Description
`num_timesteps`	`int`	`1000`	Total diffusion timesteps.
`beta_start`	`float`	`0.0001`	Starting noise level.
`beta_end`	`float`	`0.02`	Ending noise level.
`clamp_range`	`Tuple[float, float]`	`(-1.5, 1.5)`	Output clamping range. Interacts with normalization settings.

CLIPHFParams¶

Type key: clip_hf

Inherits from TransformerHFParams.

Field	Type	Default	Description
`hf_pretrained`	`str`	`None`	HF model identifier (e.g., `"openai/clip-vit-base-patch32"`).
`freeze_text_encoder`	`bool`	`False`	Freeze the text encoder weights.
`freeze_image_encoder`	`bool`	`False`	Freeze the image encoder weights.

CLIP_OpenCLIPParams¶

Type key: clip_openclip

Field	Type	Default	Description
`architecture`	`str`	`None`	OpenCLIP architecture name.
`pretrained_weights`	`str`	`None`	Pretrained weights identifier.
`freeze_text_encoder`	`bool`	`False`	Freeze the text encoder weights.
`freeze_image_encoder`	`bool`	`False`	Freeze the image encoder weights.

CLIPBackboneParams¶

Type key: clip_backbone

Inherits from both BackboneParams and CLIPHFParams. Used as the vision-language backbone in DiffusionPolicyParams.

Field	Type	Default	Description
`disable_text`	`bool`	`False`	Disable the text branch (vision-only conditioning).

VLMBackboneParams¶

Type key: vlm_backbone

Inherits from both BackboneParams and VLMHFParams. Uses a pretrained VLM as a vision-language backbone for diffusion conditioning, extracting hidden states from the last N layers.

Field	Type	Default	Description
`hf_pretrained`	`str`	`None`	HF model identifier. (Inherited from VLMHFParams)
`num_vlm_layers_to_use`	`int`	`4`	Number of last VLM layers to extract hidden states from for diffusion conditioning.

VLMFoundryBackboneParams¶

Type key: vlm_foundry_backbone

Inherits from BackboneParams. Uses a VLA Foundry-trained VLM as a vision-language backbone for diffusion conditioning, extracting hidden states from the last N layers.

Field	Type	Default	Description
`num_vlm_layers_to_use`	`int`	`4`	Number of last VLM layers to extract hidden states from for diffusion conditioning.

ViTBackboneParams¶

Type key: vit_backbone

Inherits from both BackboneParams and ViTParams. Uses a Vision Transformer as a vision-only backbone for diffusion conditioning. All fields are inherited from ViTParams.