Converting recordings¶
The rd convert command turns a raw recording directory (containing
.svo2 / .bag camera files and robot_data.npz) into a
a structured dataset directly consumable by policy-training frameworks.
Usage¶
Running the command opens an interactive fzf selector to pick one or more tasks to convert (Tab to multi-select).
By default rd convert reads from ./data/raw/ and writes to ./data/processed/.
Pass --data-dir to use a different root:
Episode filtering¶
rd convert only processes successful demonstrations — recordings marked as
failure or pending in the metadata database are always skipped.
Additionally, recordings already marked as converted in the database are
skipped by default, so re-running rd convert after adding new demonstrations
only processes the new ones. To re-convert everything regardless:
If a recording has no database entry (e.g. recorded before the DB was set up) its status is treated as unknown and it is included.
Depth backends for ZED cameras¶
ZED stereo cameras support three depth estimation backends, selected with
--stereo-method:
| Backend | Flag | Description |
|---|---|---|
| ZED SDK (default) | --stereo-method zed |
On-device NEURAL_LIGHT depth from the ZED SDK. Fast, no extra GPU setup required. |
| Fast Foundation Stereo | --stereo-method ffs |
Fast Foundation Stereo — a foundation model for stereo depth. Requires a CUDA GPU; see Installation for setup. |
| TRI Stereo | --stereo-method tri_stereo |
TRI's learned stereo depth model, tailored for robot manipulation scenes. Available in c32 (faster) and c64 (higher quality) variants. Requires a CUDA GPU; see TRI Stereo Depth for setup. |
# Use Fast Foundation Stereo
rd convert --stereo-method ffs
# Reduce FFS input resolution for speed (default scale 1.0)
rd convert --stereo-method ffs --ffs-scale 0.5
# Increase FFS update iterations for quality (default 8)
rd convert --stereo-method ffs --ffs-iters 16
# Use TRI Stereo (c64 variant, default)
rd convert --stereo-method tri_stereo
# Use the lighter c32 variant
rd convert --stereo-method tri_stereo --tri-stereo-variant c32
When --stereo-method tri_stereo is used, Raiden automatically selects the fastest
available backend for the chosen variant:
- TensorRT — if
stereo_c64.engine(orstereo_c32.engine) exists in~/.config/raiden/weights/tri_stereo/ - ONNX Runtime — if
stereo_c64.onnx(orstereo_c32.onnx) exists - PyTorch — falls back to the
.pthcheckpoint
Learned stereo backends are GPU-heavy
Both Fast Foundation Stereo and TRI Stereo depth are GPU-intensive models and may be slow depending on your hardware. For real-time throughput, compile TensorRT engines.
Multi-camera synchronization¶
All ZED SVO2 cameras are extracted simultaneously in a single pass. On each output frame slot the converter advances any camera that lags the most recent camera by more than half a frame period (~16 ms at 30 fps) before saving. This guarantees that for any output index N the images from all cameras are within 16 ms of each other.
Cameras are also truncated to the minimum frame count across all cameras, so every camera directory contains exactly the same number of frames.
Output layout¶
<recording_dir>/
split_all.json
metadata_shared.json
calibration_results.json # copied from the first recording directory
0000/
metadata.json
lowdim/
0000000000.pkl # per-frame lowdim (see below)
0000000001.pkl
...
rgb/
scene_camera/
0000000000.jpg # JPEG quality ≥ 90
0000000001.jpg
...
timestamps.npy # int64[N], nanoseconds (wall-clock)
left_wrist_camera/
...
right_wrist_camera/
...
depth/
scene_camera/
0000000000.npz # array key "depth", uint16, millimetres
...
left_wrist_camera/
...
File formats¶
rgb/<camera>/<frame:010d>.jpg
: BGR JPEG at quality ≥ 90. For cameras physically mounted upside-down
(right_wrist_camera) the image is rotated 180° so the stored image is
always right-side-up.
depth/<camera>/<frame:010d>.npz
: Compressed NumPy archive with a single key "depth" holding a uint16
array (height × width) in millimetres. Zero means no-data.
rgb/<camera>/timestamps.npy
: int64 NumPy array of length N. Each value is the wall-clock capture
timestamp of the corresponding frame in nanoseconds (Unix epoch), on the
same clock as robot_data.npz timestamps. For ZED cameras this comes
directly from sl.TIME_REFERENCE.IMAGE; for RealSense cameras it is
derived from the hardware timestamp corrected to wall-clock via the
clock offset measured at recording start.
lowdim/<frame:010d>.pkl
: Per-frame lowdim file (one per frame, shared across all cameras). Keys:
| Key | Shape | Description |
|---|---|---|
intrinsics |
dict[str → (3, 3) float32] |
{camera_name: K} — pinhole camera matrix [[fx, 0, cx], [0, fy, cy], [0, 0, 1]]. Principal point adjusted for upside-down cameras. |
extrinsics |
dict[str → (4, 4) float32] |
{camera_name: T_cam2world} in the left_arm_base frame for this frame. Scene cameras: static calibrated extrinsics. Wrist cameras: computed from forward kinematics + hand-eye calibration. |
joints |
(14,) float32 |
Follower joint positions at this frame: [left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)]. |
action |
(26,) float32 |
FK EE poses computed from commanded joint positions (follower_*_joint_cmd): [l_pos(3), l_rot9(9), l_gripper(1), r_pos(3), r_rot9(9), r_gripper(1)]. Left arm pose is in the left_arm_base frame; right arm pose is in the right_arm_base frame. Rotation is the 3×3 matrix flattened row-major. |
action_joints |
(14,) float32 |
Commanded joint positions: [left_arm(6), left_gripper(1), right_arm(6), right_gripper(1)]. Same source as action but in joint space. |
language_task |
str | Task name. |
language_prompt |
str | Task instruction. |
Coordinate system¶
The left-arm base frame is the global coordinate origin for all poses and extrinsics. This convention is fixed regardless of whether you are running a bimanual or single-arm setup - in single-arm mode the sole arm is always treated as the left arm.
Wrist camera extrinsics¶
Extrinsics for left_wrist_camera and right_wrist_camera are computed
per frame using forward kinematics (FK):
FK(q[i])- MuJoCo forward kinematics for the YAM arm evaluated at the follower joint positions interpolated to frame i's timestamp.T_cam→ee- hand-eye calibration result (camera-to-end-effector). The calibration was performed with the raw (upside-down) camera image, so forright_wrist_cameraa 180° Z-axis rotation correction is folded in.T_left_base←right_base- applied only forright_wrist_camerato bring the result from right-arm base into the common left-arm base frame.
Timestamp-based interpolation¶
joints and action are interpolated from the ~100 Hz robot_data.npz
onto the camera frame timestamps using numpy.interp. Both sources are in
the same unit (int64 wall-clock nanoseconds), so no clock-domain conversion
is needed. A single reference camera timestamp grid is used for all cameras
in the episode (preferring wall-clock timestamps from ZED cameras).
Loading in Python¶
import numpy as np
import cv2
from pathlib import Path
ep = Path("data/recordings/pick_cube_20260218_143022/0000")
# RGB frames
rgb_dir = ep / "rgb" / "scene_camera"
frames = sorted(rgb_dir.glob("*.jpg"))
ts_ns = np.load(rgb_dir / "timestamps.npy") # int64, nanoseconds
color = cv2.imread(str(frames[0])) # uint8 BGR
# Depth
depth_npz = np.load(ep / "depth" / "scene_camera" / "0000000000.npz")
depth_mm = depth_npz["depth"] # uint16, millimetres
depth_m = depth_mm.astype(np.float32) / 1000.0
# Lowdim — one file per frame
import pickle
with open(ep / "lowdim" / "0000000000.pkl", "rb") as f:
ld = pickle.load(f)
# Intrinsics and extrinsics are dicts keyed by camera name
intrinsics = ld["intrinsics"] # dict[str → (3, 3)]
extrinsics = ld["extrinsics"] # dict[str → (4, 4)]
# e.g. intrinsics["scene_camera"] → (3, 3) camera matrix K
# extrinsics["left_wrist_camera"] → (4, 4) cam2world at this frame
# Joints and action for this frame
joints = ld["joints"] # (14,): [l_arm(6), l_grip(1), r_arm(6), r_grip(1)]
action = ld["action"] # (26,): [l_pos(3), l_rot9(9), l_grip(1), r_pos(3), r_rot9(9), r_grip(1)]
# Language
task = ld["language_task"]
prompt = ld["language_prompt"]