Irobot --> Lerobot Dataset 转换脚本

嘿——快到碗里来队员 2026-01-22 18:08:06

加精

技术分享：利用 convert.py 脚本将机器人原始演示数据自动化转换为 LeRobot 数据集

背景

在具身智能模型的研究与开发中，数据的质量与格式标准化至关重要。我们从千寻机器人获得的数据不符合特定训练框架（如 LeRobot）所要求的数据集格式。手动处理这些数据不仅繁琐，而且容易出错。

为此，我们开发了一个名为 `convert.py` 的自动化数据转换脚本。本篇将介绍该工具的核心功能、工作原理及使用方法。

核心功能

1. 结构化扫描：自动扫描指定的源数据根目录，识别符合 `日期/运行ID/event_log.jsonl` 结构的所有独立“运行”（run）。
2. 智能过滤：读取每个运行中的 `event_log.jsonl` 文件，根据其中 `payload.is_mistake` 标志，自动过滤掉被标记为“错误”的整个演示片段（episode），确保最终数据集中只包含高质量的演示。
3. 格式转换：将过滤后的原始数据（通常包括多个视角的图像、关节状态、夹爪状态、以及指令）转换为 LeRobot 数据集定义的标准化格式。这包括将图像通道顺序调整为 (H, W, C)，将状态和动作数据拼接为特定维度的张量。
4. 独立存储：为源目录下的每一个“运行”单独创建一个 LeRobot 数据集，数据集名称融合了基础ID、日期和运行ID，避免了不同批次数据的混淆。
5. 灵活配置：所有关键参数（如路径、帧率、任务描述等）均通过一个外部的 JSON 配置文件进行管理，无需修改代码即可适应不同项目。

工作原理简述

脚本的执行入口是 `__main__` 部分，它首先从硬编码路径（例如 ~/convert_config.json`）加载 JSON 配置文件。

主要流程由 `convert_folder` 函数驱动：
1. `find_run_roots`：在配置的 `source_root` 下查找所有包含 `event_log.jsonl` 的子目录，每个子目录代表一次独立的机器人任务运行。
2. 对于每一个找到的运行目录：
a. 根据日期和运行ID生成唯一的新数据集 `repo_id`。
b. `get_or_create_dataset`：在 `output_root` 下创建（或加载）一个符合 `RECORDED_FEATURES` 模式定义的 LeRobot 数据集。
c. `ensure_tasks_jsonl`：在数据集元信息中写入任务描述。
d. `convert_single_run`：执行核心转换逻辑。
- 加载原始运行数据为一个临时的 LeRobotDataset。
- 调用 `filter_mistake_episode_ids` 解析 `event_log.jsonl`，获取需要跳过的错误 episode 的索引集合。
- 使用 DataLoader 逐帧遍历原始数据。
- 在内存中构建一个 episode 的缓冲区。当检测到 episode 切换时，判断上一个 episode 是否为“错误”演示：若不是，则调用 `dataset.save_episode()` 将其存入磁盘；若是，则调用 `dataset.clear_episode_buffer()` 丢弃。
- 对于非错误的帧，脚本提取 `cam_high`（头部视角）、`cam_left_wrist`、`cam_right_wrist` 的图像，以及左右臂的关节状态、夹爪状态和指令，按照 `RECORDED_FEATURES` 定义的键名（如 `observation.images.head`, `observation.state`, `action`）重新组织数据，并通过 `dataset.add_frame` 添加到缓冲区。

如何使用

1. 准备环境：确保安装 `torch` 和 `lerobot` 库。
2. 准备数据：将你的原始数据组织成 `源根目录/日期/运行ID/` 的结构，每个运行ID文件夹内应包含视频文件、状态数据以及关键的 `event_log.jsonl` 文件。
3. 编写配置文件：创建一个 JSON 文件（例如 `convert_config.json`），内容示例如下：

{
"source_root": "/path/to/your/raw_demos", 
// 【必需】原始数据根目录
"output_root": "/path/to/output/datasets", 
// 【必需】输出数据集根目录
"repo_id": "my_robot_task", 
// 【必需】数据集基础名称
"fps": 30, // 【可选】帧率。默认为30
"robot_type": "moz1", // 【可选】机器人类型。默认为"moz1"
"task_text": "Pick up the blue block and place it in the basket.", 
// 【可选】任务描述文本
"num_workers": 4, // 【可选】数据加载的工作进程数。默认为4
"append": false, 
// 【可选】是否追加到现有数据集。默认为false。若为true且输出目录已存在同名数据集，则会将新episode追加到该数据集；若为false，则遇到已存在数据集时会报错。
"video_backend": "pyav" 
// 【可选】视频解码后端。默认为"pyav"。另一个常用选项是"decord"。根据你的环境选择性能最佳的后端。
}

4. 修改脚本配置路径：在 `convert.py` 文件底部 `__main__` 部分，将 `config_path` 的硬编码路径改为你的配置文件实际路径。
5. 运行脚本：在命令行中执行 `python convert.py`。脚本将开始自动遍历、过滤和转换所有检测到的运行数据，并在终端打印处理进度。

***注意：该脚本应存放在lerobot项目下的src文件夹中方可正常运行

*以下是convert.py的完整内容：

from __future__ import annotations

import argparse
import json
from pathlib import Path
from typing import Iterable, List, Sequence, Set

import torch
from lerobot.datasets.lerobot_dataset import LeRobotDataset

# Feature schema expected by the output dataset
RECORDED_FEATURES = {
    "observation.images.head": {
        "dtype": "video",
        "shape": (240, 320, 3),
        "names": ("height", "width", "channels"),
        "info": None,
    },
    "observation.images.left_hand": {
        "dtype": "video",
        "shape": (240, 320, 3),
        "names": ("height", "width", "channels"),
        "info": None,
    },
    "observation.images.right_hand": {
        "dtype": "video",
        "shape": (240, 320, 3),
        "names": ("height", "width", "channels"),
        "info": None,
    },
    "observation.state": {
        "dtype": "float32",
        "shape": (16,),
        "names": {
            "motors": (
                "left_joint_0",
                "left_joint_1",
                "left_joint_2",
                "left_joint_3",
                "left_joint_4",
                "left_joint_5",
                "left_joint_6",
                "left_gripper",
                "right_joint_0",
                "right_joint_1",
                "right_joint_2",
                "right_joint_3",
                "right_joint_4",
                "right_joint_5",
                "right_joint_6",
                "right_gripper",
            )
        },
    },
    "action": {
        "dtype": "float32",
        "shape": (16,),
        "names": {
            "motors": (
                "left_joint_0",
                "left_joint_1",
                "left_joint_2",
                "left_joint_3",
                "left_joint_4",
                "left_joint_5",
                "left_joint_6",
                "left_gripper",
                "right_joint_0",
                "right_joint_1",
                "right_joint_2",
                "right_joint_3",
                "right_joint_4",
                "right_joint_5",
                "right_joint_6",
                "right_gripper",
            )
        },
    },
}


def find_run_roots(source_root: Path) -> List[Path]:
    """Return run directories that contain event_log.jsonl.

    Expected structure under source_root: date/run_id/event_log.jsonl.
    """
    runs = sorted(p.parent for p in source_root.glob("*/*/event_log.jsonl"))
    if not runs:
        raise FileNotFoundError(f"No runs found under {source_root}")
    return runs


def filter_mistake_episode_ids(event_log_path: Path) -> Set[int]:
    """Read event_log.jsonl and collect episode_idx where payload.is_mistake is true."""
    episode_ids: Set[int] = set()
    with event_log_path.open("r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                payload = json.loads(line).get("payload", {})
            except json.JSONDecodeError:
                continue
            if payload.get("is_mistake") is True and "episode_idx" in payload:
                episode_ids.add(int(payload["episode_idx"]))
    return episode_ids


def ensure_tasks_jsonl(dataset_root: Path, task_text: str) -> None:
    meta_dir = dataset_root / "meta"
    meta_dir.mkdir(parents=True, exist_ok=True)
    tasks_path = meta_dir / "tasks.jsonl"
    tasks_path.write_text(
        json.dumps({"task_index": 0, "task": task_text}, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )


def get_or_create_dataset(
    repo_id: str,
    output_root: Path,
    fps: int,
    robot_type: str,
    allow_append: bool,
) -> LeRobotDataset:
    dataset_root = output_root / repo_id
    if dataset_root.exists():
        if not allow_append:
            raise FileExistsError(
                f"Dataset root {dataset_root} exists. Remove it or pass --append to add episodes."
            )
        return LeRobotDataset(repo_id, root=dataset_root)
    return LeRobotDataset.create(
        repo_id,
        fps,
        root=dataset_root,
        robot_type=robot_type,
        features=RECORDED_FEATURES,
        image_writer_processes=0,
        image_writer_threads=8,
    )


def convert_single_run(
    dataset: LeRobotDataset,
    run_root: Path,
    task_text: str,
    num_workers: int,
    video_backend: str,
) -> None:
    source_dataset = LeRobotDataset("", root=run_root, video_backend=video_backend)
    invalid_episode_ids = filter_mistake_episode_ids(run_root / "event_log.jsonl")

    dataloader = torch.utils.data.DataLoader(
        source_dataset,
        num_workers=num_workers,
        batch_size=1,
        shuffle=False,
        sampler=None,
        pin_memory=False,
        drop_last=False,
    )

    current_episode = None
    for frame in dataloader:
        episode_index = int(frame["episode_index"].item())
        if current_episode is None:
            current_episode = episode_index
            print(f"=== Processing episode {current_episode}/{source_dataset.num_episodes - 1} in {run_root} ===")
        elif episode_index != current_episode:
            if current_episode not in invalid_episode_ids:
                dataset.save_episode()
            else:
                dataset.clear_episode_buffer()
            current_episode = episode_index
            print(f"=== Processing episode {current_episode}/{source_dataset.num_episodes - 1} in {run_root} ===")

        if episode_index in invalid_episode_ids:
            continue

        left_hand_image = torch.as_tensor(frame["cam_left_wrist"]) if not isinstance(frame["cam_left_wrist"], torch.Tensor) else frame["cam_left_wrist"]
        right_hand_image = torch.as_tensor(frame["cam_right_wrist"]) if not isinstance(frame["cam_right_wrist"], torch.Tensor) else frame["cam_right_wrist"]
        head_image = torch.as_tensor(frame["cam_high"]) if not isinstance(frame["cam_high"], torch.Tensor) else frame["cam_high"]

        left_joint_states = torch.as_tensor(frame["leftarm_state_joint_pos"]) if not isinstance(frame["leftarm_state_joint_pos"], torch.Tensor) else frame["leftarm_state_joint_pos"]
        left_gripper_state = torch.as_tensor(frame["leftarm_gripper_state_pos"]) if not isinstance(frame["leftarm_gripper_state_pos"], torch.Tensor) else frame["leftarm_gripper_state_pos"]
        right_joint_states = torch.as_tensor(frame["rightarm_state_joint_pos"]) if not isinstance(frame["rightarm_state_joint_pos"], torch.Tensor) else frame["rightarm_state_joint_pos"]
        right_gripper_state = torch.as_tensor(frame["rightarm_gripper_state_pos"]) if not isinstance(frame["rightarm_gripper_state_pos"], torch.Tensor) else frame["rightarm_gripper_state_pos"]

        left_joint_action = torch.as_tensor(frame["leftarm_cmd_joint_pos"]) if not isinstance(frame["leftarm_cmd_joint_pos"], torch.Tensor) else frame["leftarm_cmd_joint_pos"]
        left_gripper_action = torch.as_tensor(frame["leftarm_gripper_cmd_pos"]) if not isinstance(frame["leftarm_gripper_cmd_pos"], torch.Tensor) else frame["leftarm_gripper_cmd_pos"]
        right_joint_action = torch.as_tensor(frame["rightarm_cmd_joint_pos"]) if not isinstance(frame["rightarm_cmd_joint_pos"], torch.Tensor) else frame["rightarm_cmd_joint_pos"]
        right_gripper_action = torch.as_tensor(frame["rightarm_gripper_cmd_pos"]) if not isinstance(frame["rightarm_gripper_cmd_pos"], torch.Tensor) else frame["rightarm_gripper_cmd_pos"]

        observation = {
            "observation.state": torch.cat(
                [left_joint_states[0], left_gripper_state, right_joint_states[0], right_gripper_state], dim=0
            ).to(dtype=torch.float32),
            "observation.images.head": head_image[0].permute(1, 2, 0).to(dtype=torch.float32),
            "observation.images.left_hand": left_hand_image[0].permute(1, 2, 0).to(dtype=torch.float32),
            "observation.images.right_hand": right_hand_image[0].permute(1, 2, 0).to(dtype=torch.float32),
        }

        action = {
            "action": torch.cat(
                [left_joint_action[0], left_gripper_action, right_joint_action[0], right_gripper_action], dim=0
            ).to(dtype=torch.float32)
        }

        dataset.add_frame(frame={**observation, **action}, task=task_text)

    if current_episode is not None:
        if current_episode not in invalid_episode_ids:
            dataset.save_episode()
        else:
            dataset.clear_episode_buffer()


def convert_folder(
    source_root: Path,
    output_root: Path,
    repo_id: str,
    fps: int,
    robot_type: str,
    task_text: str,
    num_workers: int,
    append: bool,
    video_backend: str,
) -> None:
    runs = find_run_roots(source_root)
    print(f"Found {len(runs)} runs. Generating separate datasets for each run.")

    for i, run_root in enumerate(runs):
        # Generate a unique repo_id for each run to store them separately
        # Structure is usually: source_root / date_string / run_id / event_log.jsonl
        # We use the date_string and run_id to create a unique dataset name.
        date_part = run_root.parent.name
        run_id_part = run_root.name
        
        # New repo_id format: {base_repo_id}_{date}_{run_id}
        current_repo_id = f"{repo_id}_{date_part}_{run_id_part}"
        
        print(f"\n[{i+1}/{len(runs)}] Converting run '{run_root.parent.name}/{run_root.name}' -> Dataset: '{current_repo_id}'")

        dataset = get_or_create_dataset(current_repo_id, output_root, fps, robot_type, allow_append=append)
        ensure_tasks_jsonl(output_root / current_repo_id, task_text)

        convert_single_run(dataset, run_root, task_text, num_workers=num_workers, video_backend=video_backend)
        
    print("=== Conversion complete ===")


if __name__ == "__main__":
    # Hardcoded config path as requested
    config_path = Path("PATH TO convert_config.json") #此处需要替换！
    
    if not config_path.exists():
         raise FileNotFoundError(f"Config file not found at {config_path.absolute()}")

    print(f"Loading config from {config_path.absolute()}")
    with open(config_path, "r", encoding="utf-8") as f:
        config = json.load(f)

    # Validate required keys
    required_keys = ["source_root", "output_root", "repo_id"]
    for key in required_keys:
        if key not in config:
            raise ValueError(f"Config file must contain '{key}'")

    convert_folder(
        source_root=Path(config["source_root"]),
        output_root=Path(config["output_root"]),
        repo_id=config["repo_id"],
        fps=config.get("fps", 30),
        robot_type=config.get("robot_type", "moz1"),
        task_text=config.get("task_text", "Put all items on the table into the basket."),
        num_workers=config.get("num_workers", 4),
        append=config.get("append", False),
        video_backend=config.get("video_backend", "pyav"),
    )

...全文