Converting recordings¶

The rd convert command turns a raw recording directory (containing .svo2 / .bag camera files and robot_data.npz) into a a structured dataset directly consumable by policy-training frameworks.

Usage¶

rd convert

Running the command opens an interactive fzf selector to pick one or more tasks to convert (Tab to multi-select).

By default rd convert reads from ./data/raw/ and writes to ./data/processed/. Pass --data-dir to use a different root:

rd convert --data-dir /mnt/storage/robot_data

Episode filtering¶

rd convert only processes successful demonstrations — recordings marked as failure or pending in the metadata database are always skipped.

Additionally, recordings already marked as converted in the database are skipped by default, so re-running rd convert after adding new demonstrations only processes the new ones. To re-convert everything regardless:

rd convert --reconvert

If a recording has no database entry (e.g. recorded before the DB was set up) its status is treated as unknown and it is included.

Depth backends for ZED cameras¶

ZED stereo cameras support three depth estimation backends, selected with --stereo-method:

Backend	Flag	Description
ZED SDK (default)	`--stereo-method zed`	On-device NEURAL_LIGHT depth from the ZED SDK. Fast, no extra GPU setup required.
Fast Foundation Stereo	`--stereo-method ffs`	Fast Foundation Stereo — a foundation model for stereo depth. Requires a CUDA GPU; see Installation for setup.
TRI Stereo	`--stereo-method tri_stereo`	TRI's learned stereo depth model, tailored for robot manipulation scenes. Available in `c32` (faster) and `c64` (higher quality) variants. Requires a CUDA GPU; see TRI Stereo Depth for setup.

# Use Fast Foundation Stereo
rd convert --stereo-method ffs

# Reduce FFS input resolution for speed (default scale 1.0)
rd convert --stereo-method ffs --ffs-scale 0.5

# Increase FFS update iterations for quality (default 8)
rd convert --stereo-method ffs --ffs-iters 16

# Use TRI Stereo (c64 variant, default)
rd convert --stereo-method tri_stereo

# Use the lighter c32 variant
rd convert --stereo-method tri_stereo --tri-stereo-variant c32

When --stereo-method tri_stereo is used, Raiden automatically selects the fastest available backend for the chosen variant:

TensorRT — if stereo_c64.engine (or stereo_c32.engine) exists in ~/.config/raiden/weights/tri_stereo/
ONNX Runtime — if stereo_c64.onnx (or stereo_c32.onnx) exists
PyTorch — falls back to the .pth checkpoint

Learned stereo backends are GPU-heavy

Both Fast Foundation Stereo and TRI Stereo depth are GPU-intensive models and may be slow depending on your hardware. For real-time throughput, compile TensorRT engines.

Multi-camera synchronization¶

All ZED SVO2 cameras are extracted simultaneously in a single pass. On each output frame slot the converter advances any camera that lags the most recent camera by more than half a frame period (~16 ms at 30 fps) before saving. This guarantees that for any output index N the images from all cameras are within 16 ms of each other.

Cameras are also truncated to the minimum frame count across all cameras, so every camera directory contains exactly the same number of frames.

Output layout¶

<recording_dir>/
    split_all.json
    metadata_shared.json
    calibration_results.json        # copied from the first recording directory
    0000/
        metadata.json
        lowdim/
            0000000000.pkl          # per-frame lowdim (see below)
            0000000001.pkl
            ...
        rgb/
            scene_camera/
                0000000000.jpg      # JPEG quality ≥ 90
                0000000001.jpg
                ...
                timestamps.npy      # int64[N], nanoseconds (wall-clock)
            left_wrist_camera/
                ...
            right_wrist_camera/
                ...
        depth/
            scene_camera/
                0000000000.npz      # array key "depth", uint16, millimetres
                ...
            left_wrist_camera/
                ...

File formats¶

rgb/<camera>/<frame:010d>.jpg : BGR JPEG at quality ≥ 90. For cameras physically mounted upside-down (right_wrist_camera) the image is rotated 180° so the stored image is always right-side-up.

depth/<camera>/<frame:010d>.npz : Compressed NumPy archive with a single key "depth" holding a uint16 array (height × width) in millimetres. Zero means no-data.

rgb/<camera>/timestamps.npy : int64 NumPy array of length N. Each value is the wall-clock capture timestamp of the corresponding frame in nanoseconds (Unix epoch), on the same clock as robot_data.npz timestamps. For ZED cameras this comes directly from sl.TIME_REFERENCE.IMAGE; for RealSense cameras it is derived from the hardware timestamp corrected to wall-clock via the clock offset measured at recording start.

lowdim/<frame:010d>.pkl : Per-frame lowdim file (one per frame, shared across all cameras). Keys:

Key	Shape	Description
`intrinsics`	`dict[str → (3, 3) float32]`	`{camera_name: K}` — pinhole camera matrix `[[fx, 0, cx], [0, fy, cy], [0, 0, 1]]`. Principal point adjusted for upside-down cameras.
`extrinsics`	`dict[str → (4, 4) float32]`	`{camera_name: T_cam2world}` in the left_arm_base frame for this frame. Scene cameras: static calibrated extrinsics. Wrist cameras: computed from forward kinematics + hand-eye calibration.
`joints`	`(14,)` float32	Follower joint positions at this frame: `[left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)]`.
`action`	`(26,)` float32	FK EE poses computed from commanded joint positions (`follower__joint_cmd`): `[l_pos(3), l_rot9(9), l_gripper(1), r_pos(3), r_rot9(9), r_gripper(1)]`. Left arm pose is in the left_arm_base* frame; right arm pose is in the right_arm_base frame. Rotation is the 3×3 matrix flattened row-major.
`action_joints`	`(14,)` float32	Commanded joint positions: `[left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)]`. Same source as `action` but in joint space.
`language_task`	str	Task name.
`language_prompt`	str	Task instruction.

Coordinate system¶

The left-arm base frame is the global coordinate origin for all poses and extrinsics. This convention is fixed regardless of whether you are running a bimanual or single-arm setup - in single-arm mode the sole arm is always treated as the left arm.

Wrist camera extrinsics¶

Extrinsics for left_wrist_camera and right_wrist_camera are computed per frame using forward kinematics (FK):

T_left_base→cam[i] = [T_left_base←right_base @] FK(q[i]) @ T_cam→ee

FK(q[i]) - MuJoCo forward kinematics for the YAM arm evaluated at the follower joint positions interpolated to frame i's timestamp.
T_cam→ee - hand-eye calibration result (camera-to-end-effector). The calibration was performed with the raw (upside-down) camera image, so for right_wrist_camera a 180° Z-axis rotation correction is folded in.
T_left_base←right_base - applied only for right_wrist_camera to bring the result from right-arm base into the common left-arm base frame.

Timestamp-based interpolation¶

joints and action are interpolated from the ~100 Hz robot_data.npz onto the camera frame timestamps using numpy.interp. Both sources are in the same unit (int64 wall-clock nanoseconds), so no clock-domain conversion is needed. A single reference camera timestamp grid is used for all cameras in the episode (preferring wall-clock timestamps from ZED cameras).

Loading in Python¶

import numpy as np
import cv2
from pathlib import Path

ep = Path("data/recordings/pick_cube_20260218_143022/0000")

# RGB frames
rgb_dir = ep / "rgb" / "scene_camera"
frames   = sorted(rgb_dir.glob("*.jpg"))
ts_ns    = np.load(rgb_dir / "timestamps.npy")   # int64, nanoseconds

color = cv2.imread(str(frames[0]))                # uint8 BGR

# Depth
depth_npz = np.load(ep / "depth" / "scene_camera" / "0000000000.npz")
depth_mm  = depth_npz["depth"]                   # uint16, millimetres
depth_m   = depth_mm.astype(np.float32) / 1000.0

# Lowdim — one file per frame
import pickle
with open(ep / "lowdim" / "0000000000.pkl", "rb") as f:
    ld = pickle.load(f)

# Intrinsics and extrinsics are dicts keyed by camera name
intrinsics = ld["intrinsics"]   # dict[str → (3, 3)]
extrinsics = ld["extrinsics"]   # dict[str → (4, 4)]
# e.g. intrinsics["scene_camera"]        → (3, 3)  camera matrix K
#      extrinsics["left_wrist_camera"]   → (4, 4)  cam2world at this frame

# Joints and action for this frame
joints = ld["joints"]   # (14,): [l_arm(6), l_grip(1), r_arm(6), r_grip(1)]
action = ld["action"]   # (26,): [l_pos(3), l_rot9(9), l_grip(1), r_pos(3), r_rot9(9), r_grip(1)]

# Language
task   = ld["language_task"]
prompt = ld["language_prompt"]