Skip to content

Data Preprocessing

This guide covers the data preprocessing pipeline in VLA Foundry, including dependency setup, Ray parallelization, and conversion scripts for various dataset formats.

Dependencies

The preprocessing scripts use dependencies that are isolated from the training code. Install them with:

uv sync --group=preprocessing

Using Ray

Many preprocessing scripts use Ray for parallelization. The general workflow is:

  1. Create your Ray-compatible script.
  2. Start a Ray instance -- either locally or on AWS EC2 clusters.
  3. Run your script from within the Ray environment.

Local Ray

For local usage, start Ray on your machine:

ray start --head

Ray Dashboard

Add --include-dashboard=True before --head to enable the Ray dashboard for diagnostics.

AWS Ray Cluster

Template — adapt to your AWS account

ray_cluster_configs.yaml is a starting template, not a turn-key config. It will not ray up cleanly in a fresh AWS account. Before first use you must fill in the PLACEHOLDER fields to match your infrastructure:

  • SubnetIds: [subnet-PLACEHOLDER] — a subnet in your VPC with outbound internet access
  • SecurityGroupIds: [sg-PLACEHOLDER] — a security group that permits intra-cluster traffic (Ray ports) and outbound HTTPS
  • ImageId: ami-PLACEHOLDER — an Ubuntu AMI compatible with the listed InstanceTypes (m5.xlarge, i4i.4xlarge)
  • IamInstanceProfile.Arn: arn:aws:iam::ACCOUNT_ID:instance-profile/ray-autoscaler-v1 — an instance profile with EC2 autoscaling + the S3 read/write permissions your workload needs
  • Tags (owner.email, project) — adjust to your organization's tagging convention

The template has been validated on our internal setup but has not been tested against a clean AWS account. If ray up or ray attach fails, expect to iterate on the AMI / subnet / security group settings. Prefer the Local Ray path above if your preprocessing volume fits on a single machine.

1. Create the cluster

ray up vla_foundry/config_presets/data/preprocessing/ray_cluster_configs.yaml

Before running

You may also want to edit the following in ray_cluster_configs.yaml:

  • min_workers / max_workers — scale to your preprocessing volume
  • file_mounts: By default it copies your HF token from ~/.cache/huggingface/token. Change this if your token is stored elsewhere.
  • rsync_exclude: Currently excludes .venv and wandb. Add additional paths you want to exclude (e.g., large checkpoints).

2. Attach to the cluster

ray attach vla_foundry/config_presets/data/preprocessing/ray_cluster_configs.yaml

3. Run your script inside the cluster

# Optionally start a persistent terminal like tmux
cd vla_foundry
python some_preprocessing_script.py

Note

Ray scripts currently do not work well with uv run. Use uv sync --group=preprocessing (automatically done in ray_cluster_configs.yaml) and source .venv/bin/activate instead.

4. Shut down the cluster

When finished, exit the cluster, then from your local machine:

ray down vla_foundry/config_presets/data/preprocessing/ray_cluster_configs.yaml

Conversion Scripts

Downloading a Hugging Face Dataset to S3

python vla_foundry/data/preprocessing/hf_utils/hf_dataset_downloader.py \
  --dataset IPEC-COMMUNITY/droid_lerobot \
  --mode s3 \
  --s3-output-path s3://your-bucket/your-path/hf_datasets/droid_lerobot \
  --local-output-dir /datasets/hf_datasets/droid_lerobot \
  --preserve-structure

Converting VLM Hugging Face Captions to Tar Shards

This uses img2dataset for image downloading and webdataset shard creation. The HF dataset must already be downloaded to S3 (see section above).

python vla_foundry/data/preprocessing/preprocess_captionshf_to_tar.py \
  --cluster ray \
  --input_path s3://your-bucket/your-path/downloads/ \
  --output_path s3://your-bucket/your-path/downloads2/ \
  --url_col images \
  --caption_col texts \
  --save_additional_columns metadata

Converting Text Hugging Face Datasets to Tar Shards

python vla_foundry/data/preprocessing/preprocess_untokenized_to_tar.py \
  --s3_input_path s3://your-bucket/your-path/hf_datasets/fineweb-edu-350BT \
  --s3_output_path s3://your-bucket/your-path/datasets/text/fineweb-edu-350BT \
  --tmp_dir /tmp/finewebshards

Converting LeRobot to Tar Shards

The HF dataset must already be staged (local dir or S3 bucket). The example below uses Physical Intelligence's publicly-available pi_libero dataset.

Stage it first:

hf download physical-intelligence/pi_libero --repo-type dataset \
    --local-dir /tmp/pi_libero
aws s3 sync /tmp/pi_libero s3://your-bucket/your-path/hf_datasets/pi_libero/

Then run the preprocessor (swap s3://... for a local path like /tmp/pi_libero/ to skip the S3 staging):

source .venv/bin/activate && python vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py \
  --type "lerobot" \
  --source_episodes "['s3://your-bucket/your-path/hf_datasets/pi_libero/']" \
  --output_dir s3://your-bucket/your-path/lerobotdata/pi_libero/ \
  --camera_names "['image', 'wrist_image']" \
  --samples_per_shard 100 \
  --config_path "vla_foundry/config_presets/data/robotics_preprocessing_params_1past_14future.yaml" \
  --observation_keys "['state']" \
  --action_keys "['actions']"

The --camera_names, --observation_keys, and --action_keys shown above match pi_libero's schema; adjust them if you swap in a different LeRobot dataset (e.g., lerobot/pusht uses single-camera ['observation.image'], state ['observation.state'], and action ['action']).

Converting LBM Spartan Data to Tar Shards

This is a two-step process: first download the raw Spartan data from the public registry, then convert it to tar shards.

Available tasks and raw data sizes
Task Size
BimanualHangMugsOnMugHolderFromDryingRack 92.4 GB
BimanualHangMugsOnMugHolderFromTable 93.3 GB
BimanualLayCerealBoxOnCuttingBoardFromTopShelf 53.2 GB
BimanualLayCerealBoxOnCuttingBoardFromUnderShelf 80.6 GB
BimanualPlaceAppleFromBowlIntoBin 76.1 GB
BimanualPlaceAppleFromBowlOnCuttingBoard 73.0 GB
BimanualPlaceAvocadoFromBowlOnCuttingBoard 113.1 GB
BimanualPlaceFruitFromBowlIntoBin 164.6 GB
BimanualPlaceFruitFromBowlOnCuttingBoard 140.3 GB
BimanualPlacePearFromBowlIntoBin 75.6 GB
BimanualPlacePearFromBowlOnCuttingBoard 113.4 GB
BimanualPutMugsOnPlatesFromDryingRack 135.3 GB
BimanualPutMugsOnPlatesFromTable 69.6 GB
BimanualPutRedBellPepperInBin 70.5 GB
BimanualPutSpatulaOnPlateFromDryingRack 50.6 GB
BimanualPutSpatulaOnPlateFromTable 38.5 GB
BimanualPutSpatulaOnTableFromDryingRack 61.9 GB
BimanualPutSpatulaOnTableFromUtensilCrock 88.5 GB
BimanualStackPlatesOnTableFromDryingRack 149.9 GB
BimanualStackPlatesOnTableFromTable 215.8 GB
BimanualStoreCerealBoxUnderShelf 74.9 GB
PickAndPlaceBox 15.6 GB
PlaceCupByCoaster 121.7 GB
PlaceCupOnCoaster 199.3 GB
PushCoasterToCenterOfTable 164.4 GB
PushCoasterToMug 182.8 GB
PutBananaInCenterOfTable 20.2 GB
PutBananaOnSaucer 23.3 GB
PutCupInCenterOfTable 95.8 GB
PutCupOnSaucer 201.2 GB
PutGreenAppleInCenterOfTable 55.9 GB
PutGreenAppleOnSaucer 19.9 GB
PutKiwiInCenterOfTable 19.6 GB
PutKiwiOnSaucer 43.0 GB
PutMugOnSaucer 115.6 GB
PutOrangeInCenterOfTable 44.8 GB
PutOrangeOnSaucer 20.9 GB
PutSpatulaInUtensilCrock 54.2 GB
PutSpatulaInUtensilCrockFromDryingRack 46.2 GB
TurnCupUpsideDown 394.2 GB
TurnMugRightsideUp 220.3 GB

Total: 40 tasks, ~3.8 TB. Raw data is not available for PushBox — use the pre-processed download instead.

Step 1: Download the raw data

python vla_foundry/data/scripts/download_dataset.py \
  --task PutGreenAppleOnSaucer \
  --local_path /data/raw \
  --raw

python vla_foundry/data/scripts/download_dataset.py \
  --task PickAndPlaceBox \
  --local_path /data/raw \
  --raw

This downloads and extracts the raw Spartan episodes to /data/raw/tasks/<TaskName>/.

Tip

Use --dry_run to preview what will be downloaded. Use --all --raw to download raw data for every task.

Step 2: Preprocess into tar shards

The --source_episodes argument expects paths to diffusion_spartan/ directories. After downloading, locate them with:

find /data/raw/tasks -name diffusion_spartan -type d
# Example output:
# /data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T15-59-59-08-00/diffusion_spartan
# /data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T16-33-05-08-00/diffusion_spartan
# /data/raw/tasks/PickAndPlaceBox/cabot/sim/bc/teleop/2025-09-08T15-18-57-04-00/diffusion_spartan

Then pass them to the preprocessor:

python vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py \
  --type "spartan" \
  --source_episodes "[
      '/data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T15-59-59-08-00/diffusion_spartan/',
      '/data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T16-33-05-08-00/diffusion_spartan/',
      '/data/raw/tasks/PickAndPlaceBox/cabot/sim/bc/teleop/2025-09-08T15-18-57-04-00/diffusion_spartan/']" \
  --output_dir /data/preprocessed/mixed_tasks/ \
  --camera_names "include vla_foundry/config_presets/data/lbm/lbm_data_camera_names_4cameras.yaml" \
  --language_annotations_path vla_foundry/config_presets/data/lbm/lbm_language_annotations.yaml \
  --action_fields_config_path vla_foundry/config_presets/data/lbm/lbm_action_fields.yaml \
  --data_discard_keys "include vla_foundry/config_presets/data/lbm/lbm_data_discard_key.yaml" \
  --samples_per_shard 100 \
  --config_path "vla_foundry/config_presets/data/robotics_preprocessing_params_1past_14future.yaml"