HyperParams¶

HyperParams controls all optimization-related settings: learning rate, scheduler, optimizer, precision, batch sizing, and gradient handling.

Source: vla_foundry/params/hyper_params.py

Fields¶

Precision¶

Field	Type	Default	Description
`precision`	`str`	`"amp_bfloat16"`	Training precision mode. See Precision Modes below.

Batch Sizing¶

Field	Type	Default	Description
`global_batch_size`	`int`	`512`	Total batch size across all GPUs and accumulation steps.
`per_gpu_batch_size`	`int`	`8`	Microbatch size per GPU per forward pass.

Randomness¶

Field	Type	Default	Description
`seed`	`int`	`42`	Random seed for reproducibility. Propagated to `DataParams` and other components.

Learning Rate¶

Field	Type	Default	Description
`lr`	`float`	`1e-4`	Peak learning rate.
`lr_scheduler`	`str`	`"cosine"`	Learning rate schedule. Options: `"cosine"`, `"const"`, `"linear"`.
`warmup`	`str`	`"1000"`	Warmup duration. Integer values are treated as step counts; decimal values (e.g., `"0.025"`) are treated as fractions of total training steps.
`decay`	`str`	`"0.3"`	Decay phase duration. Same format as `warmup`.
`lr_cooldown_end`	`float`	`0.0`	Learning rate at the end of the cooldown phase. Must be less than or equal to `lr`.
`force_min_lr`	`float`	`0.0`	Absolute minimum learning rate floor.

Optimizer¶

Field	Type	Default	Description
`optimizer`	`str`	`"adamw"`	Optimizer type.
`wd`	`float`	`0.01`	Weight decay.
`beta1`	`float`	`0.9`	Adam beta1 (first moment decay).
`beta2`	`float`	`0.95`	Adam beta2 (second moment decay).
`eps`	`float`	`1e-8`	Adam epsilon for numerical stability.

Loss¶

Field	Type	Default	Description
`loss_function`	`str`	`"cross_entropy"`	Loss function. Options include `"cross_entropy"`, `"mse"`.
`z_loss_coefficient`	`float`	`0.0`	Coefficient for the auxiliary z-loss (logit regularization). Set to `0.0` to disable.

Gradient Handling¶

Field	Type	Default	Description
`grad_clip_norm`	`float`	`None`	Maximum gradient norm for clipping. `None` disables clipping.
`grad_checkpointing`	`bool`	`False`	Enable gradient checkpointing (activation recomputation) to reduce memory usage.

Compilation¶

Field	Type	Default	Description
`torchcompile`	`bool`	`False`	Enable `torch.compile` for the model.

Internal / Shared¶

Field	Type	Default	Description
`world_size`	`int`	`1`	Shared. Auto-set from `DistributedParams.world_size`.

Computed Properties¶

`accum_freq`¶

Gradient accumulation frequency, computed from batch sizing:

@property
def accum_freq(self):
    return global_batch_size // (world_size * per_gpu_batch_size)

For example, with global_batch_size=512, world_size=8, and per_gpu_batch_size=8:

accum_freq = 512 // (8 * 8) = 8

This means 8 microbatch forward passes are accumulated before each optimizer step.

Precision Modes¶

Value	AMP	Pure BF16	Description
`"amp_bfloat16"` / `"amp_bf16"` / `"amp"`	Yes	No	Automatic mixed precision with bfloat16. Recommended.
`"pure_bf16"`	No	Yes	All operations in bfloat16. Lower memory but may reduce stability.
`"fp32"` / `"float32"`	No	No	Full float32 precision. Highest memory usage.

Choosing Precision

"amp_bfloat16" is the default and recommended setting. It provides a good balance of speed, memory, and numerical stability. Use "pure_bf16" only if you are memory-constrained and have validated training stability. Use "fp32" for debugging numerical issues.

Validation¶

The following assertions are checked during construction:

lr >= lr_cooldown_end --- the cooldown end rate cannot exceed the peak learning rate.
global_batch_size % (world_size * per_gpu_batch_size) == 0 --- the global batch must divide evenly into accumulation steps.

Example Configurations¶

LLM Training (Cosine Schedule)¶

hparams:
  precision: "amp_bfloat16"
  lr: 1e-4
  lr_scheduler: "cosine"
  warmup: "1000"
  decay: "0.3"
  global_batch_size: 512
  per_gpu_batch_size: 8
  optimizer: "adamw"
  wd: 0.01
  loss_function: "cross_entropy"

Diffusion Policy (Constant LR)¶

hparams:
  precision: "amp_bf16"
  lr: 5e-4
  lr_scheduler: "cosine"
  lr_cooldown_end: 1e-5
  grad_clip_norm: 1.0
  loss_function: "mse"
  global_batch_size: 128
  per_gpu_batch_size: 16