DataParams¶

DataParams is the polymorphic base class for all dataset configurations. Like ModelParams, it uses draccus ChoiceRegistry --- the concrete subclass is selected by the type field.

Source: vla_foundry/params/base_data_params.py (base), vla_foundry/params/data_params.py (subclasses)

Base Fields¶

Every DataParams subclass inherits these fields:

Field	Type	Default	Description
`type`	`str`	`None`	Registry key that selects the concrete subclass (`"text"`, `"text_untokenized"`, `"image_caption"`, `"robotics"`).
`dataset_manifest`	`List[str]`	`[]`	List of paths to dataset manifest files (local or S3). One entry per dataset.
`dataset_weighting`	`List[float]`	`[]`	Sampling weight for each dataset. Must have the same length as `dataset_manifest`.
`dataset_modality`	`List[str]`	`[]`	Modality label for each dataset (e.g., `"text"`, `"image_caption"`, `"robotics"`).
`val_dataset_manifest`	`List[str]`	`[]`	Manifest paths for validation datasets.
`val_dataset_weighting`	`List[float]`	`[]`	Sampling weights for validation datasets.
`allow_multiple_epochs`	`bool`	`False`	Allow the dataloader to loop over the dataset more than once. Required when using `num_epochs`.
`num_workers`	`Optional[int]`	`None`	Number of dataloader workers per GPU. If `None`, auto-calculated as `cpu_count // world_size`.
`prefetch_factor`	`int`	`4`	Number of batches to prefetch per worker (PyTorch DataLoader).
`seq_len`	`int`	`2048`	Sequence length for tokenized data.
`shuffle`	`bool`	`True`	Shuffle the dataset.
`shuffle_buffer_size`	`int`	`2000`	Size of the shuffle buffer (WebDataset streaming shuffle).
`shuffle_initial`	`int`	`500`	Initial shuffle buffer fill size.
`seed`	`int`	`42`	Shared. Random seed, inherited from `HyperParams.seed`.

TextDataParams¶

Type key: text

For pre-tokenized text datasets. Adds no fields beyond the base class.

data:
  type: text
  dataset_manifest: ["s3://my-bucket/text-data/manifest.jsonl"]
  dataset_modality: ["text"]
  dataset_weighting: [1.0]
  seq_len: 2048

TextUntokenizedDataParams¶

Type key: text_untokenized

For raw (untokenized) text datasets. Tokenization happens on-the-fly using the specified tokenizer.

Field	Type	Default	Description
`tokenizer`	`str`	`"EleutherAI/gpt-neox-20b"`	HuggingFace tokenizer identifier.

Computed property:

Property	Description
`pad_token_id`	The pad token ID from the loaded tokenizer. If the tokenizer has no pad token, `[PAD]` is added automatically.

data:
  type: text_untokenized
  tokenizer: "EleutherAI/gpt-neox-20b"
  dataset_manifest: ["s3://my-bucket/raw-text/manifest.jsonl"]
  dataset_modality: ["text"]
  dataset_weighting: [1.0]

ImageCaptionDataParams¶

Type key: image_caption

For image-caption paired datasets, typically used for VLM training.

Field	Type	Default	Description
`processor`	`str`	`"google/paligemma-3b-pt-224"`	HuggingFace processor identifier for image and text preprocessing.
`img_num_tokens`	`int`	`256`	Number of image tokens in the sequence.
`image_size`	`int`	`224`	Input image resolution in pixels.
`augmentation`	`DataAugmentationParams`	`DataAugmentationParams()`	Image augmentation configuration.

Computed properties:

Property	Description
`image_token_id`	The image token ID from the loaded processor.
`pad_token_id`	The pad token ID from the processor's tokenizer.

data:
  type: image_caption
  processor: "google/paligemma-3b-pt-224"
  img_num_tokens: 256
  image_size: 224
  dataset_manifest: ["s3://my-bucket/image-caption/manifest.jsonl"]
  dataset_modality: ["image_caption"]
  dataset_weighting: [1.0]

RoboticsDataParams¶

Type key: robotics

The most feature-rich data params subclass. Configures camera inputs, proprioception/action field mappings, normalization, augmentation, and temporal windowing for robotics policy training.

Source: vla_foundry/params/data_params.py

Dataset and Processor¶

Field	Type	Default	Description
`dataset_statistics`	`list[str]`	`[]`	Paths to dataset statistics JSON files (one per manifest). Required for normalization.
`val_dataset_statistics`	`list[str]`	`[]`	Paths to validation dataset statistics.
`processor`	`str`	`None`	HuggingFace processor identifier for image preprocessing.
`img_num_tokens`	`int`	`256`	Number of image tokens.
`image_size`	`int`	`224`	Input image resolution.

Camera Configuration¶

Field	Type	Default	Description
`camera_names`	`list[str]`	`[]`	Camera names (e.g., `["scene_right_0", "wrist_left_plus"]`). Auto-detected from preprocessing config if empty.
`image_indices`	`list[int]`	`[]`	Temporal indices for images (e.g., `[-1, 0]` for previous and current frame). Auto-detected if empty.
`image_names`	`list[str]`	`[]`	Computed from `camera_names` and `image_indices` (e.g., `"scene_right_0_t-1"`).
`pad_missing_images`	`bool`	`False`	Pad missing camera images with zeros instead of erroring.
`mask_padded_images`	`bool`	`False`	Provide a mask indicating which images were padded. Requires `pad_missing_images`.

Field Definitions¶

Field	Type	Default	Description
`proprioception_fields`	`list[str]`	`[]`	Names of proprioception fields from the dataset (e.g., joint positions, gripper state).
`action_fields`	`list[str]`	`[]`	Names of action fields from the dataset (e.g., relative poses, gripper commands).
`pose_groups`	`list[Dict[str, str]]`	`[]`	Groups of position/rotation fields for relative coordinate transforms.
`intrinsics_fields`	`list[str]`	`[]`	Camera intrinsics field names.
`extrinsics_fields`	`list[str]`	`[]`	Camera extrinsics field names.

Normalization and Augmentation¶

Field	Type	Default	Description
`normalization`	`NormalizationParams`	`NormalizationParams()`	Global and per-field normalization configuration.
`augmentation`	`DataAugmentationParams`	`DataAugmentationParams()`	Image augmentation settings.

Temporal Configuration¶

Field	Type	Default	Description
`lowdim_past_timesteps`	`Optional[int]`	`None`	Number of past observation timesteps. Falls back to `normalization.lowdim_past_timesteps`.
`lowdim_future_timesteps`	`Optional[int]`	`None`	Number of future action timesteps. Falls back to `normalization.lowdim_future_timesteps`.

Computed Dimensions¶

Field	Type	Default	Description
`action_dim`	`int`	`None`	Auto-computed by summing dimensions of all `action_fields` from normalization statistics.
`proprioception_dim`	`Optional[int]`	`None`	Auto-computed by summing dimensions of all `proprioception_fields` from normalization statistics.

Language Configuration¶

Field	Type	Default	Description
`language_instruction_types`	`list[str]`	`["original"]`	Which language instruction variants to use. Valid values: `"original"`, `"randomized"`, `"verbose"`, `"alternative"`.

Automatic Dimension Computation

action_dim and proprioception_dim are computed during __post_init__ by loading the dataset statistics file and summing the dimensions of each named field. You generally do not need to set these manually. If you do set them, the values are validated against the computed values.

Example YAML¶

data:
  type: robotics
  dataset_manifest:
    - s3://my-bucket/dataset/shards/manifest.jsonl
  dataset_statistics:
    - s3://my-bucket/dataset/shards/stats.json
  dataset_modality:
    - robotics
  dataset_weighting:
    - 1.0
  camera_names:
    - scene_right_0
    - scene_left_0
    - wrist_left_plus
    - wrist_right_minus
  image_indices:
    - -1
    - 0
  proprioception_fields:
    - robot__actual__poses__left::panda__xyz
    - robot__actual__poses__left::panda__rot_6d
    - robot__actual__grippers__left::panda_hand
  action_fields:
    - robot__action__poses__left::panda__xyz_relative
    - robot__action__poses__left::panda__rot_6d_relative
    - robot__action__grippers__left::panda_hand
  normalization:
    enabled: true
    method: percentile_1_99
    scope: global
    epsilon: 1e-2
    centered_norm: true
  augmentation:
    image:
      crop:
        enabled: true
        shape: [224, 224]
        mode: center
  image_size: 224
  processor: "openai/clip-vit-base-patch32"
  allow_multiple_epochs: true
  lowdim_past_timesteps: 1
  lowdim_future_timesteps: 14