HyperParams
HyperParams controls all optimization-related settings: learning rate, scheduler, optimizer, precision, batch sizing, and gradient handling.
Source: vla_foundry/params/hyper_params.py
Fields
Precision
| Field | Type | Default | Description |
precision | str | "amp_bfloat16" | Training precision mode. See Precision Modes below. |
Batch Sizing
| Field | Type | Default | Description |
global_batch_size | int | 512 | Total batch size across all GPUs and accumulation steps. |
per_gpu_batch_size | int | 8 | Microbatch size per GPU per forward pass. |
Randomness
| Field | Type | Default | Description |
seed | int | 42 | Random seed for reproducibility. Propagated to DataParams and other components. |
Learning Rate
| Field | Type | Default | Description |
lr | float | 1e-4 | Peak learning rate. |
lr_scheduler | str | "cosine" | Learning rate schedule. Options: "cosine", "const", "linear". |
warmup | str | "1000" | Warmup duration. Integer values are treated as step counts; decimal values (e.g., "0.025") are treated as fractions of total training steps. |
decay | str | "0.3" | Decay phase duration. Same format as warmup. |
lr_cooldown_end | float | 0.0 | Learning rate at the end of the cooldown phase. Must be less than or equal to lr. |
force_min_lr | float | 0.0 | Absolute minimum learning rate floor. |
Optimizer
| Field | Type | Default | Description |
optimizer | str | "adamw" | Optimizer type. |
wd | float | 0.01 | Weight decay. |
beta1 | float | 0.9 | Adam beta1 (first moment decay). |
beta2 | float | 0.95 | Adam beta2 (second moment decay). |
eps | float | 1e-8 | Adam epsilon for numerical stability. |
Loss
| Field | Type | Default | Description |
loss_function | str | "cross_entropy" | Loss function. Options include "cross_entropy", "mse". |
z_loss_coefficient | float | 0.0 | Coefficient for the auxiliary z-loss (logit regularization). Set to 0.0 to disable. |
Gradient Handling
| Field | Type | Default | Description |
grad_clip_norm | float | None | Maximum gradient norm for clipping. None disables clipping. |
grad_checkpointing | bool | False | Enable gradient checkpointing (activation recomputation) to reduce memory usage. |
Compilation
| Field | Type | Default | Description |
torchcompile | bool | False | Enable torch.compile for the model. |
Internal / Shared
| Field | Type | Default | Description |
world_size | int | 1 | Shared. Auto-set from DistributedParams.world_size. |
Computed Properties
accum_freq
Gradient accumulation frequency, computed from batch sizing:
@property
def accum_freq(self):
return global_batch_size // (world_size * per_gpu_batch_size)
For example, with global_batch_size=512, world_size=8, and per_gpu_batch_size=8:
accum_freq = 512 // (8 * 8) = 8
This means 8 microbatch forward passes are accumulated before each optimizer step.
Precision Modes
| Value | AMP | Pure BF16 | Description |
"amp_bfloat16" / "amp_bf16" / "amp" | Yes | No | Automatic mixed precision with bfloat16. Recommended. |
"pure_bf16" | No | Yes | All operations in bfloat16. Lower memory but may reduce stability. |
"fp32" / "float32" | No | No | Full float32 precision. Highest memory usage. |
Choosing Precision
"amp_bfloat16" is the default and recommended setting. It provides a good balance of speed, memory, and numerical stability. Use "pure_bf16" only if you are memory-constrained and have validated training stability. Use "fp32" for debugging numerical issues.
Validation
The following assertions are checked during construction:
lr >= lr_cooldown_end --- the cooldown end rate cannot exceed the peak learning rate. global_batch_size % (world_size * per_gpu_batch_size) == 0 --- the global batch must divide evenly into accumulation steps.
Example Configurations
LLM Training (Cosine Schedule)
hparams:
precision: "amp_bfloat16"
lr: 1e-4
lr_scheduler: "cosine"
warmup: "1000"
decay: "0.3"
global_batch_size: 512
per_gpu_batch_size: 8
optimizer: "adamw"
wd: 0.01
loss_function: "cross_entropy"
Diffusion Policy (Constant LR)
hparams:
precision: "amp_bf16"
lr: 5e-4
lr_scheduler: "cosine"
lr_cooldown_end: 1e-5
grad_clip_norm: 1.0
loss_function: "mse"
global_batch_size: 128
per_gpu_batch_size: 16