Draft for Review|Version 0.2 · 2025-12-21

This specification is under active development. We welcome feedback from researchers and engineers at ortf@gerra.com

SPECIFICATION

Open Robot Training Format

ORTF v0.2

StatusDraft
Last Updated2025-12-21
Authorsgerra
LicenseApache 2.0
1

Abstract

The Open Robot Training Format (ORTF) is a specification for storing and exchanging robot teleoperation data. ORTF defines a directory structure, file formats, and metadata schemas that enable interoperability across research labs, robotics companies, and machine learning training pipelines.

An ORTF dataset contains synchronized multi-modal observations (camera streams, depth, proprioceptive state, audio), recorded actions, task annotations, operator notes, and comprehensive metadata describing the robot, sensors, and coordinate frames. The format is designed for efficient streaming, compatibility with existing ML frameworks, and lossless conversion to and from other common formats like LeRobot and RLDS.

2

Motivation

Robot learning research lacks a standardized format for teleoperation data. The Open X-Embodiment project documented over 60 datasets from 34 research labs, each using incompatible formats with heterogeneous observation and action spaces. This fragmentation creates significant barriers:

  • Conversion overhead: Labs spend significant engineering effort building format converters instead of doing research
  • Lost metadata: Critical information (coordinate frames, action space semantics, calibration) is lost during conversion
  • Reproducibility gaps: Without standardized metadata, reproducing results across labs is difficult
  • Storage inefficiency: Some formats (raw TFRecord, uncompressed HDF5) are 18-73x larger than necessary

ORTF addresses these issues by providing a self-describing format with explicit semantics, efficient storage, and defined conversion paths to existing formats.

3

Scope

3.1 In Scope

  • Teleoperation demonstrations (human-operated robots)
  • Multi-modal observations: RGB, depth, proprioception, audio, force/torque
  • Action recording with explicit space definitions
  • Multi-camera setups with calibration
  • Task and episode annotations
  • Operator notes and quality flags
  • Scene object annotations (bounding boxes)
  • Robot and sensor metadata

3.2 Out of Scope

  • Simulation state snapshots (use ManiSkill HDF5 format)
  • Reinforcement learning trajectories with rewards (use RLDS)
  • Synthetic video data without robot actions
  • Real-time streaming protocols

3.3 Design Principles

  1. Self-describing: All semantics encoded in metadata, no implicit conventions
  2. Efficient: Compressed video, columnar data, streamable
  3. Interoperable: Lossless conversion to LeRobot and RLDS formats
  4. Validatable: Machine-checkable schema for all metadata
  5. Extensible: Optional fields for domain-specific data
4

Terminology

Dataset
A collection of episodes recorded with the same robot, sensor configuration, and task family. A dataset has one manifest describing the shared configuration.
Episode
A single, contiguous recording of robot operation. Episodes are the fundamental unit of training data. An episode begins when recording starts and ends when the task is completed, failed, or manually terminated.
Step
A single timestep within an episode. Each step contains an observation, an action, and metadata flags. Steps are indexed from 0.
Observation
The robot's sensory input at a given step: camera images, proprioceptive state, and any additional sensor readings.
Action
The command sent to the robot at a given step. Actions may be in joint space, end-effector space, or a hybrid representation.
Proprioception
The robot's internal state sensing: joint positions, velocities, torques, end-effector pose, and gripper state.
Egocentric Sensor
A sensor mounted on the robot (e.g., wrist camera, joint encoders). Moves with the robot during operation.
Exocentric Sensor
A sensor fixed in the environment (e.g., overhead camera, external motion capture). Observes the robot from outside.
5

Format Overview

ORTF uses a hybrid file format optimized for both storage efficiency and streaming access:

Data TypeFormatRationale
Tabular dataApache ParquetColumnar, compressed, Arrow-compatible, streamable
Video streamsMP4 (H.264/H.265)Standard codec, hardware decode, 10-50x compression
Depth mapsMP4 (16-bit) or NPZLossless 16-bit encoding or compressed NumPy
AudioMP4 (AAC) or WAVCompressed or lossless audio, widely supported
MetadataJSONHuman-readable, versionable, schema-validatable
Episode indexParquetFast episode lookup without scanning data files
AnnotationsJSON + JPEGBounding boxes, annotated frames, operator notes

This combination follows the same pattern as LeRobot v3.0, enabling near-trivial conversion between formats. Unlike monolithic formats (HDF5, TFRecord), this structure supports partial downloads and streaming.

6

Directory Structure

An ORTF dataset is a directory with the following structure. Files are organized into chunks to support datasets of arbitrary size.

ORTF Directory Structure
my_dataset/
├── meta/
│   ├── manifest.json              # Dataset manifest (required)
│   ├── episodes.parquet           # Episode index (required)
│   ├── stats.json                 # Normalization statistics
│   └── tasks.jsonl                # Task definitions
├── data/
│   ├── chunk-000/
│   │   └── steps.parquet          # Step data: observations, actions, flags
│   ├── chunk-001/
│   │   └── steps.parquet
│   └── ...
├── videos/
│   ├── cam_wrist/
│   │   ├── chunk-000/
│   │   │   └── episode_000000.mp4
│   │   └── ...
│   ├── cam_overhead/
│   │   └── ...
│   └── depth/
│       └── ...
├── annotations/                   # Episode-level annotations (optional)
│   ├── episode_000000/
│   │   ├── scene_objects.json     # Bounding boxes and object labels
│   │   ├── first_frame.jpg        # Annotated first frame
│   │   └── depth_colorized.jpg    # Colorized depth visualization
│   └── ...
└── robot/
    ├── robot.urdf                 # Robot description (optional)
    └── meshes/                    # URDF mesh files

6.1 Chunking

Data files are split into chunks to enable parallel processing and partial downloads. Each chunk contains a configurable number of episodes (default: 1000). Chunk directories are zero-padded to 3 digits (000-999). For datasets exceeding 1M episodes, extend to 6 digits.

6.2 Required Files

  • meta/manifest.json - Dataset manifest (see Section 7)
  • meta/episodes.parquet - Episode index with metadata
  • data/chunk-*/steps.parquet - Step-level data

6.3 Optional Files

  • meta/stats.json - Precomputed normalization statistics
  • meta/tasks.jsonl - Task descriptions and IDs
  • videos/ - Camera streams (if using external video)
  • annotations/ - Episode-level annotations (see Section 14)
  • robot/ - URDF, meshes, configuration files
7

Dataset Manifest

The manifest is the root metadata file describing the entire dataset. It MUST be present at meta/manifest.json. The manifest describes the shared configuration across all episodes—individual episode metadata is stored in episodes.parquet.

meta/manifest.json
{
  "ortf_version": "0.2",
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Kitchen Manipulation Dataset",
  "description": "Teleoperated demonstrations of kitchen tasks",
  "created_at": "2025-12-21T10:30:00Z",
  "license": "CC-BY-4.0",

  "robot": { ... },              // See Section 8
  "action_space": { ... },       // See Section 9
  "observation_space": { ... },  // See Section 10
  "sensors": [ ... ],            // See Section 11
  "frames": { ... },             // See Section 12

  "collection": {
    "operator": "human_teleop",
    "interface": "spacemouse",
    "location": "Stanford AI Lab"
  },

  "statistics": {
    "total_episodes": 1000,
    "total_steps": 250000,
    "total_duration_hours": 12.5
  },

  "timestamp_reference": "episode_start",
  "sync_tolerance_ms": 33.3
}

7.1 Required Fields

FieldTypeDescription
ortf_versionstringSpecification version (e.g., "0.2")
dataset_idstringUnique identifier (UUID recommended)
robotobjectRobot description (see Section 8)
action_spaceobjectAction space definition (see Section 9)
observation_spaceobjectObservation space definition (see Section 10)
sensorsobject[]Sensor specifications (see Section 11)
framesobjectCoordinate frame definitions (see Section 12)
8

Robot Description

The robot description provides information needed to interpret proprioceptive data and actions. This section supports both simple descriptions and full URDF references.

Robot Description Schema
{
  "robot": {
    "id": "panda-001",
    "name": "Franka Panda",
    "manufacturer": "Franka Emika",
    "type": "manipulator",
    "class": "arm",
    "dof": 7,
    "urdf_path": "robot/panda.urdf",

    "joints": [
      {"name": "panda_joint1", "type": "revolute", "index": 0, "limits": [-2.8973, 2.8973]},
      {"name": "panda_joint2", "type": "revolute", "index": 1, "limits": [-1.7628, 1.7628]},
      {"name": "panda_joint3", "type": "revolute", "index": 2, "limits": [-2.8973, 2.8973]},
      {"name": "panda_joint4", "type": "revolute", "index": 3, "limits": [-3.0718, -0.0698]},
      {"name": "panda_joint5", "type": "revolute", "index": 4, "limits": [-2.8973, 2.8973]},
      {"name": "panda_joint6", "type": "revolute", "index": 5, "limits": [-0.0175, 3.7525]},
      {"name": "panda_joint7", "type": "revolute", "index": 6, "limits": [-2.8973, 2.8973]}
    ],

    "gripper": {
      "type": "parallel_jaw",
      "max_width": 0.08,
      "fingers": 2
    },

    "end_effector_link": "panda_hand"
  }
}

8.1 Joint Specification

Each joint in the joints array describes one degree of freedom:

FieldRequiredDescription
nameYesJoint name (matches URDF if provided)
typeYesrevolute | prismatic | continuous
limitsNo[min, max] in radians or meters
indexYesPosition in joint state vector

8.2 URDF Reference

If urdf_path is provided, it MUST point to a valid URDF file in the robot/ directory. Joint names in the manifest MUST match joint names in the URDF. Associated mesh files should be included in robot/meshes/.

9

Action Space

The action space defines the semantics of the action field in step data. This is critical for training—without explicit action space definitions, models cannot interpret recorded commands correctly.

Action Space Schema
{
  "action_space": {
    "type": "end_effector_delta",
    "frame": "end_effector",
    "control_frequency_hz": 10,

    "dimensions": [
      {"name": "dx",      "index": 0, "type": "position", "units": "meters",  "range": [-0.05, 0.05]},
      {"name": "dy",      "index": 1, "type": "position", "units": "meters",  "range": [-0.05, 0.05]},
      {"name": "dz",      "index": 2, "type": "position", "units": "meters",  "range": [-0.05, 0.05]},
      {"name": "droll",   "index": 3, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
      {"name": "dpitch",  "index": 4, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
      {"name": "dyaw",    "index": 5, "type": "rotation", "units": "radians", "range": [-0.25, 0.25]},
      {"name": "gripper", "index": 6, "type": "binary",   "units": "discrete", "values": [0, 1]}
    ],

    "normalization": {
      "method": "min_max",
      "range": [-1, 1]
    }
  }
}

9.1 Action Types

joint_position

Target joint positions in radians. Dimension equals robot DOF.

joint_velocity

Target joint velocities in rad/s. Dimension equals robot DOF.

end_effector_pose

Target end-effector pose. Position (3D) + orientation (quaternion or euler).

end_effector_delta

Delta end-effector command. Commonly 6D (dx, dy, dz, droll, dpitch, dyaw) + gripper.

9.2 Dimension Specification

Each dimension in the dimensions array describes one element of the action vector:

Dimension Schema
{
  "name": "dx",           // Human-readable name
  "index": 0,             // Position in action vector
  "type": "position",     // position | rotation | velocity | binary | continuous
  "units": "meters",      // meters | radians | rad_per_s | normalized | discrete
  "range": [-0.05, 0.05], // Valid value range
  "description": "End-effector delta X in base frame"  // Optional
}
10

Observations

Observations are the robot's sensory inputs at each step. The observation space defines what data is available and how to interpret it.

Observation Space Schema
{
  "observation_space": {
    "state": {
      "joint_positions":   {"dim": 7,  "units": "radians"},
      "joint_velocities":  {"dim": 7,  "units": "rad_per_s"},
      "ee_position":       {"dim": 3,  "units": "meters", "frame": "robot_base"},
      "ee_orientation":    {"dim": 4,  "units": "quaternion", "order": "wxyz"},
      "gripper_position":  {"dim": 1,  "units": "normalized", "range": [0, 1]}
    },

    "images": {
      "cam_wrist": "wrist_rgb",       // Maps to sensor name
      "cam_overhead": "overhead_rgb"
    }
  }
}

10.1 State Vector

The state field contains the robot's proprioceptive state. Each component is stored in the Parquet file under observation.state.*:

ComponentDimensionUnits
joint_positions[N_joints]radians
joint_velocities[N_joints]rad/s
joint_torques[N_joints]Nm (optional)
ee_position[3]meters, in base frame
ee_orientation[4]quaternion (w, x, y, z)
gripper_position[1] or [2]normalized [0, 1] or meters

10.2 Image Observations

Camera observations are stored in MP4 files under videos/. The images field maps observation keys to sensor names defined in Section 11. Frame indices are stored in the Parquet data.

11

Sensor Specification

Each sensor in the dataset must be fully specified in the manifest. This enables proper interpretation of data and supports multi-sensor setups.

Sensor Schema
{
  "sensors": [
    {
      "name": "wrist_rgb",
      "type": "camera",
      "mount": "egocentric",
      "mount_link": "panda_hand",

      "resolution": {"width": 640, "height": 480},
      "fps": 30,
      "encoding": "h264",

      "intrinsics": {
        "fx": 612.0, "fy": 612.0,
        "cx": 320.0, "cy": 240.0
      },

      "extrinsics": {
        "parent_frame": "panda_hand",
        "translation": [0.05, 0.0, 0.02],
        "rotation": [0.707, 0.0, 0.707, 0.0]
      }
    },
    {
      "name": "overhead_rgb",
      "type": "camera",
      "mount": "exocentric",

      "resolution": {"width": 1280, "height": 720},
      "fps": 30,
      "encoding": "h264",

      "intrinsics": {
        "fx": 900.0, "fy": 900.0,
        "cx": 640.0, "cy": 360.0
      },

      "extrinsics": {
        "parent_frame": "world",
        "translation": [0.5, 0.0, 1.2],
        "rotation": [0.5, 0.5, -0.5, -0.5]
      }
    },
    {
      "name": "depth",
      "type": "depth_camera",
      "mount": "exocentric",

      "resolution": {"width": 640, "height": 480},
      "fps": 30,
      "bit_depth": 16,
      "depth_units": "millimeters",
      "depth_range": [0.1, 10.0],

      "extrinsics": {
        "parent_frame": "world",
        "translation": [0.5, 0.0, 1.2],
        "rotation": [0.5, 0.5, -0.5, -0.5]
      }
    },
    {
      "name": "audio",
      "type": "audio",

      "sample_rate": 44100,
      "channels": 1,
      "codec": "aac"
    }
  ]
}

11.1 Sensor Types

camera

RGB or depth camera

Required: resolution, fps, intrinsics
Optional: distortion, depth_range, encoding
depth_camera

Depth sensor (structured light, ToF, stereo)

Required: resolution, fps, depth_units, depth_range
Optional: intrinsics, baseline, bit_depth
force_torque

6-axis force/torque sensor

Required: rate_hz, range_force, range_torque
Optional: noise_model
lidar

2D or 3D LiDAR scanner

Required: scan_rate, range, channels
Optional: fov_horizontal, fov_vertical
audio

Microphone

Required: sample_rate, channels
Optional: codec, bit_depth

11.2 Mount Types

  • egocentric: Mounted on robot, moves with it. Requires mount_link specifying the attachment point.
  • exocentric: Fixed in environment, observes robot from outside. Requires extrinsics relative to world or base frame.
12

Coordinate Frames

All spatial data in ORTF is expressed relative to explicitly defined coordinate frames. This eliminates ambiguity when combining data from multiple sources.

Frame Definitions
{
  "frames": {
    "world": {
      "type": "fixed",
      "convention": "z_up",
      "description": "Global reference frame, gravity-aligned"
    },
    "robot_base": {
      "type": "fixed",
      "parent": "world",
      "transform": {
        "translation": [0.0, 0.0, 0.75],
        "rotation": [1.0, 0.0, 0.0, 0.0]
      },
      "description": "Robot base link origin"
    },
    "end_effector": {
      "type": "dynamic",
      "source": "forward_kinematics",
      "description": "Tool center point, computed from joint state"
    }
  }
}

12.1 Standard Frames

FrameDescriptionConvention
robot_baseRobot base link originZ-up, X-forward
worldGlobal reference frameZ-up, gravity-aligned
end_effectorTool center pointZ along tool axis

12.2 Transform Convention

Transforms are specified as {translation: [x,y,z], rotation: [qw,qx,qy,qz]}. Rotation is a unit quaternion in (w, x, y, z) order. The transform takes points from the child frame to the parent frame: p_parent = T * p_child.

13

Episode Structure

Episodes are indexed in meta/episodes.parquet. Each row contains episode-level metadata including success labels, operator notes, and file references:

Episode Index Schema
# meta/episodes.parquet columns:
episode_id:       string    # Unique episode identifier (UUID or sequential)
task_id:          int64     # Reference to tasks.jsonl
start_step:       int64     # First step index in data files
end_step:         int64     # Last step index (exclusive)
length:           int64     # Number of steps
duration_seconds: float64   # Episode duration
success:          bool      # Task success label (null if not evaluated)
failure_reason:   string    # Reason for failure (null if success or not evaluated)
operator_notes:   string    # Free-form notes from teleoperator
chunk_id:         int64     # Which chunk contains this episode
recorded_at:      string    # ISO 8601 timestamp of recording
video_files: struct{        # Video file references per camera
  cam_wrist:      string
  cam_overhead:   string
  depth:          string
}
audio_file:       string    # Path to audio file (optional)

13.1 Step Data

Step-level data is stored in data/chunk-*/steps.parquet:

Steps Schema
# data/chunk-*/steps.parquet columns:
episode_id:       string
step_index:       int64
timestamp:        float64
is_first:         bool
is_last:          bool
is_terminal:      bool
action:           float32[7]
observation.state.joint_positions:    float32[7]
observation.state.joint_velocities:   float32[7]
observation.state.ee_position:        float32[3]
observation.state.ee_orientation:     float32[4]
observation.state.gripper_position:   float32[1]
observation.images.cam_wrist.frame_index:     int64
observation.images.cam_overhead.frame_index:  int64
observation.images.depth.frame_index:         int64

13.2 Step Flags

Each step includes boundary flags following the RLDS convention:

FlagTypeDescription
is_firstboolTrue only for first step of episode
is_lastboolTrue only for last step of episode
is_terminalboolTrue if episode ended due to task completion/failure (not truncation)

13.3 Task Definitions

Episodes reference tasks by ID. Task definitions are stored in meta/tasks.jsonl:

Task Definitions
// meta/tasks.jsonl - one JSON object per line
{
  "task_id": 0,
  "type": "box_pick_v0",
  "instruction": "Pick up the red box (Box A) and place it in the bowl (Bowl A)",
  "description": "The robot picks up a colored box and places it in a container"
}
{
  "task_id": 1,
  "type": "drawer_open_v0",
  "instruction": "Open the drawer and retrieve the spoon",
  "description": "The robot opens a drawer and retrieves an object from inside"
}
14

Annotations

ORTF supports rich annotations at both the episode and frame level. Annotations are stored in the annotations/ directory, organized by episode.

14.1 Scene Object Annotations

Bounding boxes and object labels for the first frame of each episode. Objects are typically referenced in the task instruction (e.g., "pick up Box A"):

Scene Objects
// annotations/episode_000000/scene_objects.json
{
  "annotated_at": "2025-12-21T10:30:00Z",
  "annotated_frame": 0,
  "annotated_frame_timestamp": 0.0,
  "objects": [
    {
      "label": "Box A",
      "class": "box",
      "bounding_box": {
        "x": 120,
        "y": 80,
        "width": 64,
        "height": 64
      },
      "color": "#ef4444"
    },
    {
      "label": "Bowl A",
      "class": "container",
      "bounding_box": {
        "x": 340,
        "y": 200,
        "width": 96,
        "height": 72
      },
      "color": "#3b82f6"
    },
    {
      "label": "Table B",
      "class": "surface",
      "bounding_box": {
        "x": 0,
        "y": 280,
        "width": 640,
        "height": 200
      },
      "color": "#22c55e"
    }
  ]
}

14.2 Operator Notes

Free-form notes from the human teleoperator, stored in the operator_notes column of episodes.parquet. These capture qualitative observations about the recording:

Operator Notes Examples
// Example operator notes stored in episodes.parquet
"Robot arm was slightly unstable during reach. Gripper slipped on first attempt."
"Perfect execution. No issues."
"Had to restart recording - first attempt had calibration drift."
"Object was heavier than expected, caused slight overshoot on placement."

14.3 Episode-Level Files

  • scene_objects.json - Bounding boxes and object labels
  • first_frame.jpg - Annotated first frame with bounding boxes drawn
  • depth_colorized.jpg - Colorized depth visualization for verification
15

Timestamps and Synchronization

All temporal data uses float64 timestamps in seconds. The reference point depends on the timestamp_reference field in the manifest.

15.1 Timestamp Reference

  • unix: Seconds since Unix epoch (1970-01-01T00:00:00Z)
  • episode_start: Seconds since episode recording began
  • robot_boot: Seconds since robot controller boot

15.2 Multi-Modal Synchronization

Different sensors may operate at different frequencies. ORTF preserves native measurement rates rather than resampling. The sync_tolerance_ms field in the manifest specifies the maximum acceptable timestamp difference between modalities considered "synchronized".

For video data, frame timestamps are stored in the Parquet file. The frame_index column maps each step to its corresponding video frame.

16

Interoperability

ORTF is designed for lossless conversion to and from existing formats. The following conversions are defined:

16.1 LeRobot v3.0

ORTF uses the same underlying file formats as LeRobot (Parquet + MP4), making conversion straightforward:

  • meta/manifest.jsonmeta/info.json
  • meta/episodes.parquetmeta/episodes/
  • data/ structure maps directly
  • ORTF-specific metadata (frames, sensors, annotations) stored in meta/ortf_extended.json

16.2 RLDS / Open X-Embodiment

Conversion to RLDS TFRecord format:

  • Each episode → one RLDS episode
  • Step flags map directly (is_first, is_last, is_terminal)
  • Video frames decoded and stored as image tensors
  • Action space normalized to 7D EE format if needed (with dimension mapping in metadata)

16.3 Reference Implementations

ORTF → LeRobot
from ortf import load_dataset
from ortf.convert import to_lerobot

ds = load_dataset("path/to/ortf")
to_lerobot(ds, "output/lerobot")
ORTF → RLDS
from ortf import load_dataset
from ortf.convert import to_rlds

ds = load_dataset("path/to/ortf")
to_rlds(ds, "output/rlds")
17

Validation

ORTF datasets can be validated programmatically using the reference validator. Validation checks:

  • Manifest schema compliance
  • Required files present
  • Parquet schema matches manifest
  • Video file integrity and frame count
  • Episode boundaries consistent
  • Timestamp monotonicity
  • Action/observation dimensions match specification
  • Annotation file structure (if present)
Validation
# Install validator
pip install ortf-validator

# Validate dataset
ortf validate path/to/dataset

# Validate with strict mode (all optional checks)
ortf validate path/to/dataset --strict

# Check specific episode
ortf validate path/to/dataset --episode 000042
18

Acknowledgments

We thank the following researchers and engineers for their review and feedback on this specification:

Reviewers will be listed here upon completion of the review process.

Interested in reviewing? Contact ortf@gerra.com

How to Cite

gerra. (2025). Open Robot Training Format (ORTF) Specification v0.2. https://gerra.com/research/ortf

Related Work