This specification is under active development. We welcome feedback from researchers and engineers at ortf@gerra.com
ORTF v0.2
The Open Robot Training Format (ORTF) is a specification for storing and exchanging robot teleoperation data. ORTF defines a directory structure, file formats, and metadata schemas that enable interoperability across research labs, robotics companies, and machine learning training pipelines.
An ORTF dataset contains synchronized multi-modal observations (camera streams, depth, proprioceptive state, audio), recorded actions, task annotations, operator notes, and comprehensive metadata describing the robot, sensors, and coordinate frames. The format is designed for efficient streaming, compatibility with existing ML frameworks, and lossless conversion to and from other common formats like LeRobot and RLDS.
Robot learning research lacks a standardized format for teleoperation data. The Open X-Embodiment project documented over 60 datasets from 34 research labs, each using incompatible formats with heterogeneous observation and action spaces. This fragmentation creates significant barriers:
ORTF addresses these issues by providing a self-describing format with explicit semantics, efficient storage, and defined conversion paths to existing formats.
ORTF uses a hybrid file format optimized for both storage efficiency and streaming access:
| Data Type | Format | Rationale |
|---|---|---|
| Tabular data | Apache Parquet | Columnar, compressed, Arrow-compatible, streamable |
| Video streams | MP4 (H.264/H.265) | Standard codec, hardware decode, 10-50x compression |
| Depth maps | MP4 (16-bit) or NPZ | Lossless 16-bit encoding or compressed NumPy |
| Audio | MP4 (AAC) or WAV | Compressed or lossless audio, widely supported |
| Metadata | JSON | Human-readable, versionable, schema-validatable |
| Episode index | Parquet | Fast episode lookup without scanning data files |
| Annotations | JSON + JPEG | Bounding boxes, annotated frames, operator notes |
This combination follows the same pattern as LeRobot v3.0, enabling near-trivial conversion between formats. Unlike monolithic formats (HDF5, TFRecord), this structure supports partial downloads and streaming.
An ORTF dataset is a directory with the following structure. Files are organized into chunks to support datasets of arbitrary size.
my_dataset/
├── meta/
│ ├── manifest.json # Dataset manifest (required)
│ ├── episodes.parquet # Episode index (required)
│ ├── stats.json # Normalization statistics
│ └── tasks.jsonl # Task definitions
├── data/
│ ├── chunk-000/
│ │ └── steps.parquet # Step data: observations, actions, flags
│ ├── chunk-001/
│ │ └── steps.parquet
│ └── ...
├── videos/
│ ├── cam_wrist/
│ │ ├── chunk-000/
│ │ │ └── episode_000000.mp4
│ │ └── ...
│ ├── cam_overhead/
│ │ └── ...
│ └── depth/
│ └── ...
├── annotations/ # Episode-level annotations (optional)
│ ├── episode_000000/
│ │ ├── scene_objects.json # Bounding boxes and object labels
│ │ ├── first_frame.jpg # Annotated first frame
│ │ └── depth_colorized.jpg # Colorized depth visualization
│ └── ...
└── robot/
├── robot.urdf # Robot description (optional)
└── meshes/ # URDF mesh filesData files are split into chunks to enable parallel processing and partial downloads. Each chunk contains a configurable number of episodes (default: 1000). Chunk directories are zero-padded to 3 digits (000-999). For datasets exceeding 1M episodes, extend to 6 digits.
The manifest is the root metadata file describing the entire dataset. It MUST be present at meta/manifest.json. The manifest describes the shared configuration across all episodes—individual episode metadata is stored in episodes.parquet.
{
"ortf_version": "0.2",
"dataset_id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Kitchen Manipulation Dataset",
"description": "Teleoperated demonstrations of kitchen tasks",
"created_at": "2025-12-21T10:30:00Z",
"license": "CC-BY-4.0",
"robot": { ... }, // See Section 8
"action_space": { ... }, // See Section 9
"observation_space": { ... }, // See Section 10
"sensors": [ ... ], // See Section 11
"frames": { ... }, // See Section 12
"collection": {
"operator": "human_teleop",
"interface": "spacemouse",
"location": "Stanford AI Lab"
},
"statistics": {
"total_episodes": 1000,
"total_steps": 250000,
"total_duration_hours": 12.5
},
"timestamp_reference": "episode_start",
"sync_tolerance_ms": 33.3
}| Field | Type | Description |
|---|---|---|
| ortf_version | string | Specification version (e.g., "0.2") |
| dataset_id | string | Unique identifier (UUID recommended) |
| robot | object | Robot description (see Section 8) |
| action_space | object | Action space definition (see Section 9) |
| observation_space | object | Observation space definition (see Section 10) |
| sensors | object[] | Sensor specifications (see Section 11) |
| frames | object | Coordinate frame definitions (see Section 12) |
The robot description provides information needed to interpret proprioceptive data and actions. This section supports both simple descriptions and full URDF references.
{
"robot": {
"id": "panda-001",
"name": "Franka Panda",
"manufacturer": "Franka Emika",
"type": "manipulator",
"class": "arm",
"dof": 7,
"urdf_path": "robot/panda.urdf",
"joints": [
{"name": "panda_joint1", "type": "revolute", "index": 0, "limits": [-2.8973, 2.8973]},
{"name": "panda_joint2", "type": "revolute", "index": 1, "limits": [-1.7628, 1.7628]},
{"name": "panda_joint3", "type": "revolute", "index": 2, "limits": [-2.8973, 2.8973]},
{"name": "panda_joint4", "type": "revolute", "index": 3, "limits": [-3.0718, -0.0698]},
{"name": "panda_joint5", "type": "revolute", "index": 4, "limits": [-2.8973, 2.8973]},
{"name": "panda_joint6", "type": "revolute", "index": 5, "limits": [-0.0175, 3.7525]},
{"name": "panda_joint7", "type": "revolute", "index": 6, "limits": [-2.8973, 2.8973]}
],
"gripper": {
"type": "parallel_jaw",
"max_width": 0.08,
"fingers": 2
},
"end_effector_link": "panda_hand"
}
}Each joint in the joints array describes one degree of freedom:
| Field | Required | Description |
|---|---|---|
| name | Yes | Joint name (matches URDF if provided) |
| type | Yes | revolute | prismatic | continuous |
| limits | No | [min, max] in radians or meters |
| index | Yes | Position in joint state vector |
If urdf_path is provided, it MUST point to a valid URDF file in the robot/ directory. Joint names in the manifest MUST match joint names in the URDF. Associated mesh files should be included in robot/meshes/.
The action space defines the semantics of the action field in step data. This is critical for training—without explicit action space definitions, models cannot interpret recorded commands correctly.
{
"action_space": {
"type": "end_effector_delta",
"frame": "end_effector",
"control_frequency_hz": 10,
"dimensions": [
{"name": "dx", "index": 0, "type": "position", "units": "meters", "range": [-0.05, 0.05]},
{"name": "dy", "index": 1, "type": "position", "units": "meters", "range": [-0.05, 0.05]},
{"name": "dz", "index": 2, "type": "position", "units": "meters", "range": [-0.05, 0.05]},
{"name": "droll", "index": 3, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
{"name": "dpitch", "index": 4, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
{"name": "dyaw", "index": 5, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
{"name": "gripper", "index": 6, "type": "binary", "units": "discrete", "values": [0, 1]}
],
"normalization": {
"method": "min_max",
"range": [-1, 1]
}
}
}Target joint positions in radians. Dimension equals robot DOF.
Target joint velocities in rad/s. Dimension equals robot DOF.
Target end-effector pose. Position (3D) + orientation (quaternion or euler).
Delta end-effector command. Commonly 6D (dx, dy, dz, droll, dpitch, dyaw) + gripper.
Each dimension in the dimensions array describes one element of the action vector:
{
"name": "dx", // Human-readable name
"index": 0, // Position in action vector
"type": "position", // position | rotation | velocity | binary | continuous
"units": "meters", // meters | radians | rad_per_s | normalized | discrete
"range": [-0.05, 0.05], // Valid value range
"description": "End-effector delta X in base frame" // Optional
}Observations are the robot's sensory inputs at each step. The observation space defines what data is available and how to interpret it.
{
"observation_space": {
"state": {
"joint_positions": {"dim": 7, "units": "radians"},
"joint_velocities": {"dim": 7, "units": "rad_per_s"},
"ee_position": {"dim": 3, "units": "meters", "frame": "robot_base"},
"ee_orientation": {"dim": 4, "units": "quaternion", "order": "wxyz"},
"gripper_position": {"dim": 1, "units": "normalized", "range": [0, 1]}
},
"images": {
"cam_wrist": "wrist_rgb", // Maps to sensor name
"cam_overhead": "overhead_rgb"
}
}
}The state field contains the robot's proprioceptive state. Each component is stored in the Parquet file under observation.state.*:
| Component | Dimension | Units |
|---|---|---|
| joint_positions | [N_joints] | radians |
| joint_velocities | [N_joints] | rad/s |
| joint_torques | [N_joints] | Nm (optional) |
| ee_position | [3] | meters, in base frame |
| ee_orientation | [4] | quaternion (w, x, y, z) |
| gripper_position | [1] or [2] | normalized [0, 1] or meters |
Camera observations are stored in MP4 files under videos/. The images field maps observation keys to sensor names defined in Section 11. Frame indices are stored in the Parquet data.
Each sensor in the dataset must be fully specified in the manifest. This enables proper interpretation of data and supports multi-sensor setups.
{
"sensors": [
{
"name": "wrist_rgb",
"type": "camera",
"mount": "egocentric",
"mount_link": "panda_hand",
"resolution": {"width": 640, "height": 480},
"fps": 30,
"encoding": "h264",
"intrinsics": {
"fx": 612.0, "fy": 612.0,
"cx": 320.0, "cy": 240.0
},
"extrinsics": {
"parent_frame": "panda_hand",
"translation": [0.05, 0.0, 0.02],
"rotation": [0.707, 0.0, 0.707, 0.0]
}
},
{
"name": "overhead_rgb",
"type": "camera",
"mount": "exocentric",
"resolution": {"width": 1280, "height": 720},
"fps": 30,
"encoding": "h264",
"intrinsics": {
"fx": 900.0, "fy": 900.0,
"cx": 640.0, "cy": 360.0
},
"extrinsics": {
"parent_frame": "world",
"translation": [0.5, 0.0, 1.2],
"rotation": [0.5, 0.5, -0.5, -0.5]
}
},
{
"name": "depth",
"type": "depth_camera",
"mount": "exocentric",
"resolution": {"width": 640, "height": 480},
"fps": 30,
"bit_depth": 16,
"depth_units": "millimeters",
"depth_range": [0.1, 10.0],
"extrinsics": {
"parent_frame": "world",
"translation": [0.5, 0.0, 1.2],
"rotation": [0.5, 0.5, -0.5, -0.5]
}
},
{
"name": "audio",
"type": "audio",
"sample_rate": 44100,
"channels": 1,
"codec": "aac"
}
]
}RGB or depth camera
resolution, fps, intrinsicsdistortion, depth_range, encodingDepth sensor (structured light, ToF, stereo)
resolution, fps, depth_units, depth_rangeintrinsics, baseline, bit_depth6-axis force/torque sensor
rate_hz, range_force, range_torquenoise_model2D or 3D LiDAR scanner
scan_rate, range, channelsfov_horizontal, fov_verticalMicrophone
sample_rate, channelscodec, bit_depthAll spatial data in ORTF is expressed relative to explicitly defined coordinate frames. This eliminates ambiguity when combining data from multiple sources.
{
"frames": {
"world": {
"type": "fixed",
"convention": "z_up",
"description": "Global reference frame, gravity-aligned"
},
"robot_base": {
"type": "fixed",
"parent": "world",
"transform": {
"translation": [0.0, 0.0, 0.75],
"rotation": [1.0, 0.0, 0.0, 0.0]
},
"description": "Robot base link origin"
},
"end_effector": {
"type": "dynamic",
"source": "forward_kinematics",
"description": "Tool center point, computed from joint state"
}
}
}| Frame | Description | Convention |
|---|---|---|
| robot_base | Robot base link origin | Z-up, X-forward |
| world | Global reference frame | Z-up, gravity-aligned |
| end_effector | Tool center point | Z along tool axis |
Transforms are specified as {translation: [x,y,z], rotation: [qw,qx,qy,qz]}. Rotation is a unit quaternion in (w, x, y, z) order. The transform takes points from the child frame to the parent frame: p_parent = T * p_child.
Episodes are indexed in meta/episodes.parquet. Each row contains episode-level metadata including success labels, operator notes, and file references:
# meta/episodes.parquet columns:
episode_id: string # Unique episode identifier (UUID or sequential)
task_id: int64 # Reference to tasks.jsonl
start_step: int64 # First step index in data files
end_step: int64 # Last step index (exclusive)
length: int64 # Number of steps
duration_seconds: float64 # Episode duration
success: bool # Task success label (null if not evaluated)
failure_reason: string # Reason for failure (null if success or not evaluated)
operator_notes: string # Free-form notes from teleoperator
chunk_id: int64 # Which chunk contains this episode
recorded_at: string # ISO 8601 timestamp of recording
video_files: struct{ # Video file references per camera
cam_wrist: string
cam_overhead: string
depth: string
}
audio_file: string # Path to audio file (optional)Step-level data is stored in data/chunk-*/steps.parquet:
# data/chunk-*/steps.parquet columns:
episode_id: string
step_index: int64
timestamp: float64
is_first: bool
is_last: bool
is_terminal: bool
action: float32[7]
observation.state.joint_positions: float32[7]
observation.state.joint_velocities: float32[7]
observation.state.ee_position: float32[3]
observation.state.ee_orientation: float32[4]
observation.state.gripper_position: float32[1]
observation.images.cam_wrist.frame_index: int64
observation.images.cam_overhead.frame_index: int64
observation.images.depth.frame_index: int64Each step includes boundary flags following the RLDS convention:
| Flag | Type | Description |
|---|---|---|
| is_first | bool | True only for first step of episode |
| is_last | bool | True only for last step of episode |
| is_terminal | bool | True if episode ended due to task completion/failure (not truncation) |
Episodes reference tasks by ID. Task definitions are stored in meta/tasks.jsonl:
// meta/tasks.jsonl - one JSON object per line
{
"task_id": 0,
"type": "box_pick_v0",
"instruction": "Pick up the red box (Box A) and place it in the bowl (Bowl A)",
"description": "The robot picks up a colored box and places it in a container"
}
{
"task_id": 1,
"type": "drawer_open_v0",
"instruction": "Open the drawer and retrieve the spoon",
"description": "The robot opens a drawer and retrieves an object from inside"
}ORTF supports rich annotations at both the episode and frame level. Annotations are stored in the annotations/ directory, organized by episode.
Bounding boxes and object labels for the first frame of each episode. Objects are typically referenced in the task instruction (e.g., "pick up Box A"):
// annotations/episode_000000/scene_objects.json
{
"annotated_at": "2025-12-21T10:30:00Z",
"annotated_frame": 0,
"annotated_frame_timestamp": 0.0,
"objects": [
{
"label": "Box A",
"class": "box",
"bounding_box": {
"x": 120,
"y": 80,
"width": 64,
"height": 64
},
"color": "#ef4444"
},
{
"label": "Bowl A",
"class": "container",
"bounding_box": {
"x": 340,
"y": 200,
"width": 96,
"height": 72
},
"color": "#3b82f6"
},
{
"label": "Table B",
"class": "surface",
"bounding_box": {
"x": 0,
"y": 280,
"width": 640,
"height": 200
},
"color": "#22c55e"
}
]
}Free-form notes from the human teleoperator, stored in the operator_notes column of episodes.parquet. These capture qualitative observations about the recording:
// Example operator notes stored in episodes.parquet
"Robot arm was slightly unstable during reach. Gripper slipped on first attempt."
"Perfect execution. No issues."
"Had to restart recording - first attempt had calibration drift."
"Object was heavier than expected, caused slight overshoot on placement."All temporal data uses float64 timestamps in seconds. The reference point depends on the timestamp_reference field in the manifest.
Different sensors may operate at different frequencies. ORTF preserves native measurement rates rather than resampling. The sync_tolerance_ms field in the manifest specifies the maximum acceptable timestamp difference between modalities considered "synchronized".
For video data, frame timestamps are stored in the Parquet file. The frame_index column maps each step to its corresponding video frame.
ORTF is designed for lossless conversion to and from existing formats. The following conversions are defined:
ORTF uses the same underlying file formats as LeRobot (Parquet + MP4), making conversion straightforward:
Conversion to RLDS TFRecord format:
from ortf import load_dataset
from ortf.convert import to_lerobot
ds = load_dataset("path/to/ortf")
to_lerobot(ds, "output/lerobot")from ortf import load_dataset
from ortf.convert import to_rlds
ds = load_dataset("path/to/ortf")
to_rlds(ds, "output/rlds")ORTF datasets can be validated programmatically using the reference validator. Validation checks:
# Install validator
pip install ortf-validator
# Validate dataset
ortf validate path/to/dataset
# Validate with strict mode (all optional checks)
ortf validate path/to/dataset --strict
# Check specific episode
ortf validate path/to/dataset --episode 000042We thank the following researchers and engineers for their review and feedback on this specification:
Reviewers will be listed here upon completion of the review process.
Interested in reviewing? Contact ortf@gerra.com
gerra. (2025). Open Robot Training Format (ORTF) Specification v0.2. https://gerra.com/research/ortf