Skip to content

Converting recordings

The rd convert command turns a raw recording directory (containing .svo2 / .bag camera files and robot_data.npz) into a a structured dataset directly consumable by policy-training frameworks.

Usage

rd convert

Running the command opens an interactive fzf selector to pick one or more tasks to convert (Tab to multi-select).

By default rd convert reads from ./data/raw/ and writes to ./data/processed/. Pass --data-dir to use a different root:

rd convert --data-dir /mnt/storage/robot_data

Episode filtering

rd convert only processes successful demonstrations — recordings marked as failure or pending in the metadata database are always skipped.

Additionally, recordings already marked as converted in the database are skipped by default, so re-running rd convert after adding new demonstrations only processes the new ones. To re-convert everything regardless:

rd convert --reconvert

If a recording has no database entry (e.g. recorded before the DB was set up) its status is treated as unknown and it is included.

Depth backends for ZED cameras

ZED stereo cameras support three depth estimation backends, selected with --stereo-method:

Backend Flag Description
ZED SDK (default) --stereo-method zed On-device NEURAL_LIGHT depth from the ZED SDK. Fast, no extra GPU setup required.
Fast Foundation Stereo --stereo-method ffs Fast Foundation Stereo — a foundation model for stereo depth. Requires a CUDA GPU; see Installation for setup.
TRI Stereo --stereo-method tri_stereo TRI's learned stereo depth model, tailored for robot manipulation scenes. Available in c32 (faster) and c64 (higher quality) variants. Requires a CUDA GPU; see TRI Stereo Depth for setup.
# Use Fast Foundation Stereo
rd convert --stereo-method ffs

# Reduce FFS input resolution for speed (default scale 1.0)
rd convert --stereo-method ffs --ffs-scale 0.5

# Increase FFS update iterations for quality (default 8)
rd convert --stereo-method ffs --ffs-iters 16

# Use TRI Stereo (c64 variant, default)
rd convert --stereo-method tri_stereo

# Use the lighter c32 variant
rd convert --stereo-method tri_stereo --tri-stereo-variant c32

When --stereo-method tri_stereo is used, Raiden automatically selects the fastest available backend for the chosen variant:

  1. TensorRT — if stereo_c64.engine (or stereo_c32.engine) exists in ~/.config/raiden/weights/tri_stereo/
  2. ONNX Runtime — if stereo_c64.onnx (or stereo_c32.onnx) exists
  3. PyTorch — falls back to the .pth checkpoint

Learned stereo backends are GPU-heavy

Both Fast Foundation Stereo and TRI Stereo depth are GPU-intensive models and may be slow depending on your hardware. For real-time throughput, compile TensorRT engines.

Multi-camera synchronization

All ZED SVO2 cameras are extracted simultaneously in a single pass. On each output frame slot the converter advances any camera that lags the most recent camera by more than half a frame period (~16 ms at 30 fps) before saving. This guarantees that for any output index N the images from all cameras are within 16 ms of each other.

Cameras are also truncated to the minimum frame count across all cameras, so every camera directory contains exactly the same number of frames.

Output layout

<recording_dir>/
    split_all.json
    metadata_shared.json
    calibration_results.json        # copied from the first recording directory
    0000/
        metadata.json
        lowdim/
            0000000000.pkl          # per-frame lowdim (see below)
            0000000001.pkl
            ...
        rgb/
            scene_camera/
                0000000000.jpg      # JPEG quality ≥ 90
                0000000001.jpg
                ...
                timestamps.npy      # int64[N], nanoseconds (wall-clock)
            left_wrist_camera/
                ...
            right_wrist_camera/
                ...
        depth/
            scene_camera/
                0000000000.npz      # array key "depth", uint16, millimetres
                ...
            left_wrist_camera/
                ...

File formats

rgb/<camera>/<frame:010d>.jpg : BGR JPEG at quality ≥ 90. For cameras physically mounted upside-down (right_wrist_camera) the image is rotated 180° so the stored image is always right-side-up.

depth/<camera>/<frame:010d>.npz : Compressed NumPy archive with a single key "depth" holding a uint16 array (height × width) in millimetres. Zero means no-data.

rgb/<camera>/timestamps.npy : int64 NumPy array of length N. Each value is the wall-clock capture timestamp of the corresponding frame in nanoseconds (Unix epoch), on the same clock as robot_data.npz timestamps. For ZED cameras this comes directly from sl.TIME_REFERENCE.IMAGE; for RealSense cameras it is derived from the hardware timestamp corrected to wall-clock via the clock offset measured at recording start.

lowdim/<frame:010d>.pkl : Per-frame lowdim file (one per frame, shared across all cameras). Keys:

Key Shape Description
intrinsics dict[str → (3, 3) float32] {camera_name: K} — pinhole camera matrix [[fx, 0, cx], [0, fy, cy], [0, 0, 1]]. Principal point adjusted for upside-down cameras.
extrinsics dict[str → (4, 4) float32] {camera_name: T_cam2world} in the left_arm_base frame for this frame. Scene cameras: static calibrated extrinsics. Wrist cameras: computed from forward kinematics + hand-eye calibration.
joints (14,) float32 Follower joint positions at this frame: [left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)].
action (26,) float32 FK EE poses computed from commanded joint positions (follower_*_joint_cmd): [l_pos(3), l_rot9(9), l_gripper(1), r_pos(3), r_rot9(9), r_gripper(1)]. Left arm pose is in the left_arm_base frame; right arm pose is in the right_arm_base frame. Rotation is the 3×3 matrix flattened row-major.
action_joints (14,) float32 Commanded joint positions: [left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)]. Same source as action but in joint space.
language_task str Task name.
language_prompt str Task instruction.

Coordinate system

The left-arm base frame is the global coordinate origin for all poses and extrinsics. This convention is fixed regardless of whether you are running a bimanual or single-arm setup - in single-arm mode the sole arm is always treated as the left arm.

Wrist camera extrinsics

Extrinsics for left_wrist_camera and right_wrist_camera are computed per frame using forward kinematics (FK):

T_left_base→cam[i] = [T_left_base←right_base @] FK(q[i]) @ T_cam→ee
  • FK(q[i]) - MuJoCo forward kinematics for the YAM arm evaluated at the follower joint positions interpolated to frame i's timestamp.
  • T_cam→ee - hand-eye calibration result (camera-to-end-effector). The calibration was performed with the raw (upside-down) camera image, so for right_wrist_camera a 180° Z-axis rotation correction is folded in.
  • T_left_base←right_base - applied only for right_wrist_camera to bring the result from right-arm base into the common left-arm base frame.

Timestamp-based interpolation

joints and action are interpolated from the ~100 Hz robot_data.npz onto the camera frame timestamps using numpy.interp. Both sources are in the same unit (int64 wall-clock nanoseconds), so no clock-domain conversion is needed. A single reference camera timestamp grid is used for all cameras in the episode (preferring wall-clock timestamps from ZED cameras).

Loading in Python

import numpy as np
import cv2
from pathlib import Path

ep = Path("data/recordings/pick_cube_20260218_143022/0000")

# RGB frames
rgb_dir = ep / "rgb" / "scene_camera"
frames   = sorted(rgb_dir.glob("*.jpg"))
ts_ns    = np.load(rgb_dir / "timestamps.npy")   # int64, nanoseconds

color = cv2.imread(str(frames[0]))                # uint8 BGR

# Depth
depth_npz = np.load(ep / "depth" / "scene_camera" / "0000000000.npz")
depth_mm  = depth_npz["depth"]                   # uint16, millimetres
depth_m   = depth_mm.astype(np.float32) / 1000.0

# Lowdim — one file per frame
import pickle
with open(ep / "lowdim" / "0000000000.pkl", "rb") as f:
    ld = pickle.load(f)

# Intrinsics and extrinsics are dicts keyed by camera name
intrinsics = ld["intrinsics"]   # dict[str → (3, 3)]
extrinsics = ld["extrinsics"]   # dict[str → (4, 4)]
# e.g. intrinsics["scene_camera"]        → (3, 3)  camera matrix K
#      extrinsics["left_wrist_camera"]   → (4, 4)  cam2world at this frame

# Joints and action for this frame
joints = ld["joints"]   # (14,): [l_arm(6), l_grip(1), r_arm(6), r_grip(1)]
action = ld["action"]   # (26,): [l_pos(3), l_rot9(9), l_grip(1), r_pos(3), r_rot9(9), r_grip(1)]

# Language
task   = ld["language_task"]
prompt = ld["language_prompt"]