DataParams¶
DataParams is the polymorphic base class for all dataset configurations. Like ModelParams, it uses draccus ChoiceRegistry --- the concrete subclass is selected by the type field.
Source: vla_foundry/params/base_data_params.py (base), vla_foundry/params/data_params.py (subclasses)
Base Fields¶
Every DataParams subclass inherits these fields:
| Field | Type | Default | Description |
|---|---|---|---|
type | str | None | Registry key that selects the concrete subclass ("text", "text_untokenized", "image_caption", "robotics"). |
dataset_manifest | List[str] | [] | List of paths to dataset manifest files (local or S3). One entry per dataset. |
dataset_weighting | List[float] | [] | Sampling weight for each dataset. Must have the same length as dataset_manifest. |
dataset_modality | List[str] | [] | Modality label for each dataset (e.g., "text", "image_caption", "robotics"). |
val_dataset_manifest | List[str] | [] | Manifest paths for validation datasets. |
val_dataset_weighting | List[float] | [] | Sampling weights for validation datasets. |
allow_multiple_epochs | bool | False | Allow the dataloader to loop over the dataset more than once. Required when using num_epochs. |
num_workers | Optional[int] | None | Number of dataloader workers per GPU. If None, auto-calculated as cpu_count // world_size. |
prefetch_factor | int | 4 | Number of batches to prefetch per worker (PyTorch DataLoader). |
seq_len | int | 2048 | Sequence length for tokenized data. |
shuffle | bool | True | Shuffle the dataset. |
shuffle_buffer_size | int | 2000 | Size of the shuffle buffer (WebDataset streaming shuffle). |
shuffle_initial | int | 500 | Initial shuffle buffer fill size. |
seed | int | 42 | Shared. Random seed, inherited from HyperParams.seed. |
TextDataParams¶
Type key: text
For pre-tokenized text datasets. Adds no fields beyond the base class.
data:
type: text
dataset_manifest: ["s3://my-bucket/text-data/manifest.jsonl"]
dataset_modality: ["text"]
dataset_weighting: [1.0]
seq_len: 2048
TextUntokenizedDataParams¶
Type key: text_untokenized
For raw (untokenized) text datasets. Tokenization happens on-the-fly using the specified tokenizer.
| Field | Type | Default | Description |
|---|---|---|---|
tokenizer | str | "EleutherAI/gpt-neox-20b" | HuggingFace tokenizer identifier. |
Computed property:
| Property | Description |
|---|---|
pad_token_id | The pad token ID from the loaded tokenizer. If the tokenizer has no pad token, [PAD] is added automatically. |
data:
type: text_untokenized
tokenizer: "EleutherAI/gpt-neox-20b"
dataset_manifest: ["s3://my-bucket/raw-text/manifest.jsonl"]
dataset_modality: ["text"]
dataset_weighting: [1.0]
ImageCaptionDataParams¶
Type key: image_caption
For image-caption paired datasets, typically used for VLM training.
| Field | Type | Default | Description |
|---|---|---|---|
processor | str | "google/paligemma-3b-pt-224" | HuggingFace processor identifier for image and text preprocessing. |
img_num_tokens | int | 256 | Number of image tokens in the sequence. |
image_size | int | 224 | Input image resolution in pixels. |
augmentation | DataAugmentationParams | DataAugmentationParams() | Image augmentation configuration. |
Computed properties:
| Property | Description |
|---|---|
image_token_id | The image token ID from the loaded processor. |
pad_token_id | The pad token ID from the processor's tokenizer. |
data:
type: image_caption
processor: "google/paligemma-3b-pt-224"
img_num_tokens: 256
image_size: 224
dataset_manifest: ["s3://my-bucket/image-caption/manifest.jsonl"]
dataset_modality: ["image_caption"]
dataset_weighting: [1.0]
RoboticsDataParams¶
Type key: robotics
The most feature-rich data params subclass. Configures camera inputs, proprioception/action field mappings, normalization, augmentation, and temporal windowing for robotics policy training.
Source: vla_foundry/params/data_params.py
Dataset and Processor¶
| Field | Type | Default | Description |
|---|---|---|---|
dataset_statistics | list[str] | [] | Paths to dataset statistics JSON files (one per manifest). Required for normalization. |
val_dataset_statistics | list[str] | [] | Paths to validation dataset statistics. |
processor | str | None | HuggingFace processor identifier for image preprocessing. |
img_num_tokens | int | 256 | Number of image tokens. |
image_size | int | 224 | Input image resolution. |
Camera Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
camera_names | list[str] | [] | Camera names (e.g., ["scene_right_0", "wrist_left_plus"]). Auto-detected from preprocessing config if empty. |
image_indices | list[int] | [] | Temporal indices for images (e.g., [-1, 0] for previous and current frame). Auto-detected if empty. |
image_names | list[str] | [] | Computed from camera_names and image_indices (e.g., "scene_right_0_t-1"). |
pad_missing_images | bool | False | Pad missing camera images with zeros instead of erroring. |
mask_padded_images | bool | False | Provide a mask indicating which images were padded. Requires pad_missing_images. |
Field Definitions¶
| Field | Type | Default | Description |
|---|---|---|---|
proprioception_fields | list[str] | [] | Names of proprioception fields from the dataset (e.g., joint positions, gripper state). |
action_fields | list[str] | [] | Names of action fields from the dataset (e.g., relative poses, gripper commands). |
pose_groups | list[Dict[str, str]] | [] | Groups of position/rotation fields for relative coordinate transforms. |
intrinsics_fields | list[str] | [] | Camera intrinsics field names. |
extrinsics_fields | list[str] | [] | Camera extrinsics field names. |
Normalization and Augmentation¶
| Field | Type | Default | Description |
|---|---|---|---|
normalization | NormalizationParams | NormalizationParams() | Global and per-field normalization configuration. |
augmentation | DataAugmentationParams | DataAugmentationParams() | Image augmentation settings. |
Temporal Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
lowdim_past_timesteps | Optional[int] | None | Number of past observation timesteps. Falls back to normalization.lowdim_past_timesteps. |
lowdim_future_timesteps | Optional[int] | None | Number of future action timesteps. Falls back to normalization.lowdim_future_timesteps. |
Computed Dimensions¶
| Field | Type | Default | Description |
|---|---|---|---|
action_dim | int | None | Auto-computed by summing dimensions of all action_fields from normalization statistics. |
proprioception_dim | Optional[int] | None | Auto-computed by summing dimensions of all proprioception_fields from normalization statistics. |
Language Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
language_instruction_types | list[str] | ["original"] | Which language instruction variants to use. Valid values: "original", "randomized", "verbose", "alternative". |
Automatic Dimension Computation
action_dim and proprioception_dim are computed during __post_init__ by loading the dataset statistics file and summing the dimensions of each named field. You generally do not need to set these manually. If you do set them, the values are validated against the computed values.
Example YAML¶
data:
type: robotics
dataset_manifest:
- s3://my-bucket/dataset/shards/manifest.jsonl
dataset_statistics:
- s3://my-bucket/dataset/shards/stats.json
dataset_modality:
- robotics
dataset_weighting:
- 1.0
camera_names:
- scene_right_0
- scene_left_0
- wrist_left_plus
- wrist_right_minus
image_indices:
- -1
- 0
proprioception_fields:
- robot__actual__poses__left::panda__xyz
- robot__actual__poses__left::panda__rot_6d
- robot__actual__grippers__left::panda_hand
action_fields:
- robot__action__poses__left::panda__xyz_relative
- robot__action__poses__left::panda__rot_6d_relative
- robot__action__grippers__left::panda_hand
normalization:
enabled: true
method: percentile_1_99
scope: global
epsilon: 1e-2
centered_norm: true
augmentation:
image:
crop:
enabled: true
shape: [224, 224]
mode: center
image_size: 224
processor: "openai/clip-vit-base-patch32"
allow_multiple_epochs: true
lowdim_past_timesteps: 1
lowdim_future_timesteps: 14