Skip to content

Data Format

VLA Foundry stores training data as WebDataset tar shards indexed by a manifest.jsonl file. This format enables high-throughput streaming from local disk or S3, efficient shuffling at the shard level, and straightforward dataset mixing.

Directory Layout

A prepared dataset follows this structure:

my_dataset/
  shards/
    shard-000000.tar
    shard-000001.tar
    shard-000002.tar
    ...
    shard-000099.tar
  manifest.jsonl
  • shards/ -- A directory of tar files. Each tar file contains one or more training samples.
  • manifest.jsonl -- A line-delimited JSON file that indexes the shards.

S3 hosting (recommended)

Hosting datasets on S3 is the recommended approach for multi-node training. The WebDataset library streams shards directly from S3 via pipe, so no local copy is needed. The manifest itself is fetched once at the start of each checkpoint window.

Manifest Format

manifest.jsonl contains one JSON object per line. Each object describes a single shard.

{"shard": "shards/shard-000000.tar", "num_sequences": 512}
{"shard": "shards/shard-000001.tar", "num_sequences": 512}
{"shard": "shards/shard-000002.tar", "num_sequences": 497}
Field Type Description
shard string Relative path from the manifest to the tar file
num_sequences int Number of training samples (sequences) in this shard

The num_sequences counts are used by the dataloader to calculate how many shards to select for a given sample budget, and to set the correct epoch length for the WebLoader.

Note

The shard path in the manifest is relative to the directory containing manifest.jsonl. When the dataset is hosted on S3, the dataloader resolves the full URI automatically by joining the manifest's base path with the relative shard path.

Sample Structure Within Tar Files

Each tar file is a standard POSIX tar archive. Samples within a tar are distinguished by a shared filename prefix (the sample key). Different extensions map to different fields of the sample.

shard-000000.tar
  sample_000000.input_ids.pth
  sample_000000.labels.pth
  sample_000000.attention_mask.pth
  sample_000001.input_ids.pth
  sample_000001.labels.pth
  sample_000001.attention_mask.pth
  ...

WebDataset groups files by their shared prefix and delivers each group as a Python dictionary:

{
    "__key__": "sample_000000",
    "input_ids.pth": <tensor>,
    "labels.pth": <tensor>,
    "attention_mask.pth": <tensor>,
}

The exact set of extensions depends on the data modality:

Modality Typical Extensions
Text (tokenized) .input_ids.pth, .labels.pth, .attention_mask.pth
Image-Caption .input_ids.pth, .labels.pth, .pixel_values.pth, .attention_mask.pth
Robotics .input_ids.pth, .labels.pth, .pixel_values.pth, .actions.pth, .state.pth

Each data modality has a corresponding pipeline class in vla_foundry/data/pipelines/ that knows how to decode and collate these fields.

Multiple Datasets and Weighting

You can train on multiple datasets simultaneously by providing comma-separated manifest paths and corresponding weights.

In YAML

data:
  type: image_caption
  dataset_manifest:
    - s3://my-bucket/dataset_a/manifest.jsonl
    - s3://my-bucket/dataset_b/manifest.jsonl
  dataset_weighting:
    - 0.7
    - 0.3
  dataset_modality:
    - image_caption
    - image_caption

From the CLI

python vla_foundry/main.py \
    --data.dataset_manifest '["s3://bucket/a/manifest.jsonl","s3://bucket/b/manifest.jsonl"]' \
    --data.dataset_weighting '[0.7, 0.3]' \
    --data.dataset_modality '["image_caption","image_caption"]'

List lengths must match

dataset_manifest, dataset_weighting, and dataset_modality must all have the same length. The training loop asserts this at startup:

assert len(cfg.data.dataset_manifest) == len(cfg.data.dataset_modality)
assert len(cfg.data.dataset_manifest) == len(cfg.data.dataset_weighting)

How weighting works

At the start of each checkpoint window, the dataloader divides the window's sample budget across datasets proportionally to their weights. For example, with a budget of 1,000,000 samples and weights [0.7, 0.3], it selects shards providing roughly 700,000 samples from dataset A and 300,000 from dataset B. The selected shards are then interleaved using wds.RandomMix.

graph LR
    BUDGET["1,000,000 samples"] --> A["Dataset A: 700k"]
    BUDGET --> B["Dataset B: 300k"]
    A --> MIX["RandomMix"]
    B --> MIX
    MIX --> DL["DataLoader"]

Shard Selection and Shuffling

Shards are shuffled at the start of each checkpoint window using a deterministic seed. This means:

  • Different checkpoint windows see shards in different orders.
  • Resuming from a checkpoint replays the exact same shard order (the seed is saved alongside the checkpoint).
  • When allow_multiple_epochs is True, shards are reshuffled and reused once all shards have been consumed.

The shard-level shuffle, combined with WebDataset's within-shard sample-level shuffle buffer (shuffle_buffer_size), provides two layers of randomization without requiring the entire dataset to fit in memory.

DataLoader Configuration

Key DataParams fields that control data loading behavior:

Field Default Description
num_workers auto DataLoader worker processes per GPU. Defaults to cpu_count / world_size.
prefetch_factor 4 Batches prefetched per worker
shuffle True Enable shard and sample shuffling
shuffle_buffer_size 2000 Sample-level shuffle buffer size
shuffle_initial 500 Initial fill of the shuffle buffer before yielding
seq_len 2048 Sequence length for text-based modalities

Key Source Files

File Purpose
vla_foundry/data/dataloader.py get_wds_dataloader() and get_datastring_input()
vla_foundry/data/pipelines/ Per-modality WebDataset pipeline classes
vla_foundry/data/utils.py Manifest loading, epoch-to-sample conversion
vla_foundry/file_utils.py load_dataset_manifest() with S3 and distributed support
vla_foundry/params/base_data_params.py DataParams base class
vla_foundry/params/data_params.py Concrete data-type subclasses