Data Preprocessing¶
This guide covers the data preprocessing pipeline in VLA Foundry, including dependency setup, Ray parallelization, and conversion scripts for various dataset formats.
Dependencies¶
The preprocessing scripts use dependencies that are isolated from the training code. Install them with:
Using Ray¶
Many preprocessing scripts use Ray for parallelization. The general workflow is:
- Create your Ray-compatible script.
- Start a Ray instance -- either locally or on AWS EC2 clusters.
- Run your script from within the Ray environment.
Local Ray¶
For local usage, start Ray on your machine:
Ray Dashboard
Add --include-dashboard=True before --head to enable the Ray dashboard for diagnostics.
AWS Ray Cluster¶
Template — adapt to your AWS account
ray_cluster_configs.yaml is a starting template, not a turn-key config. It will not ray up cleanly in a fresh AWS account. Before first use you must fill in the PLACEHOLDER fields to match your infrastructure:
SubnetIds: [subnet-PLACEHOLDER]— a subnet in your VPC with outbound internet accessSecurityGroupIds: [sg-PLACEHOLDER]— a security group that permits intra-cluster traffic (Ray ports) and outbound HTTPSImageId: ami-PLACEHOLDER— an Ubuntu AMI compatible with the listedInstanceTypes (m5.xlarge,i4i.4xlarge)IamInstanceProfile.Arn: arn:aws:iam::ACCOUNT_ID:instance-profile/ray-autoscaler-v1— an instance profile with EC2 autoscaling + the S3 read/write permissions your workload needs- Tags (
owner.email,project) — adjust to your organization's tagging convention
The template has been validated on our internal setup but has not been tested against a clean AWS account. If ray up or ray attach fails, expect to iterate on the AMI / subnet / security group settings. Prefer the Local Ray path above if your preprocessing volume fits on a single machine.
1. Create the cluster¶
Before running
You may also want to edit the following in ray_cluster_configs.yaml:
min_workers/max_workers— scale to your preprocessing volumefile_mounts: By default it copies your HF token from~/.cache/huggingface/token. Change this if your token is stored elsewhere.rsync_exclude: Currently excludes.venvandwandb. Add additional paths you want to exclude (e.g., large checkpoints).
2. Attach to the cluster¶
3. Run your script inside the cluster¶
# Optionally start a persistent terminal like tmux
cd vla_foundry
python some_preprocessing_script.py
Note
Ray scripts currently do not work well with uv run. Use uv sync --group=preprocessing (automatically done in ray_cluster_configs.yaml) and source .venv/bin/activate instead.
4. Shut down the cluster¶
When finished, exit the cluster, then from your local machine:
Conversion Scripts¶
Downloading a Hugging Face Dataset to S3¶
python vla_foundry/data/preprocessing/hf_utils/hf_dataset_downloader.py \
--dataset IPEC-COMMUNITY/droid_lerobot \
--mode s3 \
--s3-output-path s3://your-bucket/your-path/hf_datasets/droid_lerobot \
--local-output-dir /datasets/hf_datasets/droid_lerobot \
--preserve-structure
Converting VLM Hugging Face Captions to Tar Shards¶
This uses img2dataset for image downloading and webdataset shard creation. The HF dataset must already be downloaded to S3 (see section above).
python vla_foundry/data/preprocessing/preprocess_captionshf_to_tar.py \
--cluster ray \
--input_path s3://your-bucket/your-path/downloads/ \
--output_path s3://your-bucket/your-path/downloads2/ \
--url_col images \
--caption_col texts \
--save_additional_columns metadata
Converting Text Hugging Face Datasets to Tar Shards¶
python vla_foundry/data/preprocessing/preprocess_untokenized_to_tar.py \
--s3_input_path s3://your-bucket/your-path/hf_datasets/fineweb-edu-350BT \
--s3_output_path s3://your-bucket/your-path/datasets/text/fineweb-edu-350BT \
--tmp_dir /tmp/finewebshards
Converting LeRobot to Tar Shards¶
The HF dataset must already be staged (local dir or S3 bucket). The example below uses Physical Intelligence's publicly-available pi_libero dataset.
Stage it first:
hf download physical-intelligence/pi_libero --repo-type dataset \
--local-dir /tmp/pi_libero
aws s3 sync /tmp/pi_libero s3://your-bucket/your-path/hf_datasets/pi_libero/
Then run the preprocessor (swap s3://... for a local path like /tmp/pi_libero/ to skip the S3 staging):
source .venv/bin/activate && python vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py \
--type "lerobot" \
--source_episodes "['s3://your-bucket/your-path/hf_datasets/pi_libero/']" \
--output_dir s3://your-bucket/your-path/lerobotdata/pi_libero/ \
--camera_names "['image', 'wrist_image']" \
--samples_per_shard 100 \
--config_path "vla_foundry/config_presets/data/robotics_preprocessing_params_1past_14future.yaml" \
--observation_keys "['state']" \
--action_keys "['actions']"
The --camera_names, --observation_keys, and --action_keys shown above match pi_libero's schema; adjust them if you swap in a different LeRobot dataset (e.g., lerobot/pusht uses single-camera ['observation.image'], state ['observation.state'], and action ['action']).
Converting LBM Spartan Data to Tar Shards¶
This is a two-step process: first download the raw Spartan data from the public registry, then convert it to tar shards.
Available tasks and raw data sizes
| Task | Size |
|---|---|
| BimanualHangMugsOnMugHolderFromDryingRack | 92.4 GB |
| BimanualHangMugsOnMugHolderFromTable | 93.3 GB |
| BimanualLayCerealBoxOnCuttingBoardFromTopShelf | 53.2 GB |
| BimanualLayCerealBoxOnCuttingBoardFromUnderShelf | 80.6 GB |
| BimanualPlaceAppleFromBowlIntoBin | 76.1 GB |
| BimanualPlaceAppleFromBowlOnCuttingBoard | 73.0 GB |
| BimanualPlaceAvocadoFromBowlOnCuttingBoard | 113.1 GB |
| BimanualPlaceFruitFromBowlIntoBin | 164.6 GB |
| BimanualPlaceFruitFromBowlOnCuttingBoard | 140.3 GB |
| BimanualPlacePearFromBowlIntoBin | 75.6 GB |
| BimanualPlacePearFromBowlOnCuttingBoard | 113.4 GB |
| BimanualPutMugsOnPlatesFromDryingRack | 135.3 GB |
| BimanualPutMugsOnPlatesFromTable | 69.6 GB |
| BimanualPutRedBellPepperInBin | 70.5 GB |
| BimanualPutSpatulaOnPlateFromDryingRack | 50.6 GB |
| BimanualPutSpatulaOnPlateFromTable | 38.5 GB |
| BimanualPutSpatulaOnTableFromDryingRack | 61.9 GB |
| BimanualPutSpatulaOnTableFromUtensilCrock | 88.5 GB |
| BimanualStackPlatesOnTableFromDryingRack | 149.9 GB |
| BimanualStackPlatesOnTableFromTable | 215.8 GB |
| BimanualStoreCerealBoxUnderShelf | 74.9 GB |
| PickAndPlaceBox | 15.6 GB |
| PlaceCupByCoaster | 121.7 GB |
| PlaceCupOnCoaster | 199.3 GB |
| PushCoasterToCenterOfTable | 164.4 GB |
| PushCoasterToMug | 182.8 GB |
| PutBananaInCenterOfTable | 20.2 GB |
| PutBananaOnSaucer | 23.3 GB |
| PutCupInCenterOfTable | 95.8 GB |
| PutCupOnSaucer | 201.2 GB |
| PutGreenAppleInCenterOfTable | 55.9 GB |
| PutGreenAppleOnSaucer | 19.9 GB |
| PutKiwiInCenterOfTable | 19.6 GB |
| PutKiwiOnSaucer | 43.0 GB |
| PutMugOnSaucer | 115.6 GB |
| PutOrangeInCenterOfTable | 44.8 GB |
| PutOrangeOnSaucer | 20.9 GB |
| PutSpatulaInUtensilCrock | 54.2 GB |
| PutSpatulaInUtensilCrockFromDryingRack | 46.2 GB |
| TurnCupUpsideDown | 394.2 GB |
| TurnMugRightsideUp | 220.3 GB |
Total: 40 tasks, ~3.8 TB. Raw data is not available for PushBox — use the pre-processed download instead.
Step 1: Download the raw data¶
python vla_foundry/data/scripts/download_dataset.py \
--task PutGreenAppleOnSaucer \
--local_path /data/raw \
--raw
python vla_foundry/data/scripts/download_dataset.py \
--task PickAndPlaceBox \
--local_path /data/raw \
--raw
This downloads and extracts the raw Spartan episodes to /data/raw/tasks/<TaskName>/.
Tip
Use --dry_run to preview what will be downloaded. Use --all --raw to download raw data for every task.
Step 2: Preprocess into tar shards¶
The --source_episodes argument expects paths to diffusion_spartan/ directories. After downloading, locate them with:
find /data/raw/tasks -name diffusion_spartan -type d
# Example output:
# /data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T15-59-59-08-00/diffusion_spartan
# /data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T16-33-05-08-00/diffusion_spartan
# /data/raw/tasks/PickAndPlaceBox/cabot/sim/bc/teleop/2025-09-08T15-18-57-04-00/diffusion_spartan
Then pass them to the preprocessor:
python vla_foundry/data/preprocessing/preprocess_robotics_to_tar.py \
--type "spartan" \
--source_episodes "[
'/data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T15-59-59-08-00/diffusion_spartan/',
'/data/raw/tasks/PutGreenAppleOnSaucer/cabot/sim/bc/teleop/2024-11-14T16-33-05-08-00/diffusion_spartan/',
'/data/raw/tasks/PickAndPlaceBox/cabot/sim/bc/teleop/2025-09-08T15-18-57-04-00/diffusion_spartan/']" \
--output_dir /data/preprocessed/mixed_tasks/ \
--camera_names "include vla_foundry/config_presets/data/lbm/lbm_data_camera_names_4cameras.yaml" \
--language_annotations_path vla_foundry/config_presets/data/lbm/lbm_language_annotations.yaml \
--action_fields_config_path vla_foundry/config_presets/data/lbm/lbm_action_fields.yaml \
--data_discard_keys "include vla_foundry/config_presets/data/lbm/lbm_data_discard_key.yaml" \
--samples_per_shard 100 \
--config_path "vla_foundry/config_presets/data/robotics_preprocessing_params_1past_14future.yaml"