Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use

Teaser Video

Technical Explaination Video

Abstract

Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy's robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71% improvement in task success and a 77% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41% compared with the state-of-the-art data collection interface.

Pan Flipping with Different Objects

Pan flipping involves dynamically flipping various objects, such as eggs, burger buns, and meat patties. This demonstrates precision, agility, and the ability to adapt to different challenges in motion control. Our framework enables robots to learn highly dynamic movements.

Flipping an Egg

Flipping an Burger Bun

Flipping an Meat Patty

Robustness Testing

Our framework is robust to various perturbations, including a moving camera, moving base, and human perturbations.

Robustness to Shaking Camera

Nail Hammering

Meatball Scooping

Pan Flipping - Egg

Base Robustness, Trajectory Memorization, and End Effector Stabilization Testing

Testing Robustness to a Shaking Base

Trajectory Memorization Test: Camera Shaking Followed by Black Input

Testing if the Policy can Act Like Chickhead with Stable End Effector When the Base is Moving

Robustness to Human Perturbation

Nail Tracking

Human Throwing Meatball

Human Flipping Back Multiple Times

Precision Tests

This section highlights the precision capabilities of our framework in tasks that require high accuracy, such as hammering a nail and wine balancing.

Hammering a Nail

Wine Balancing

Example 3D Reconstruction

This example demonstrates a 3D reconstruction generated using MASt3R. The model is created from the two input images shown below. You can interact with the 3D model by rotating and zooming in/out.

3D Reconstruction

Trajectory Smoothness Comparison

Human manipulation is inherently more natural and intuitive. Consequently, our policy generates significantly faster and smoother action trajectories compared to policies trained using teleoperated devices.

Ours

Policy Trained Using Space Mouse Collected Demonstration

Data Collection Methods

Human manipulation showcases unparalleled versatility, ranging from delicate precision tasks to intense, contact-rich interactions and dynamic, high-speed maneuvers—none of which are effectively presented by traditional teleportation systems.

Meatball Scooping

Hammering a Nail

Pan Flipping

Wine Balancing

Playing Soccer

Failure Modes

Our framework may fail to complete the task under certain conditions. If the camera moves too quickly, the system can lose track of key visual cues, leading to task failure. In pan flipping, a burger bun might bounce out of the pan when the robot applies excessive upward force.

Fast Camera Failure

Burger Bun Out of Pan

Food Preparation

Our trained pan-flipping policy assists in food preparation, collaborating with humans to flip meat patties or eggs for burgers and sandwiches.

Burger Preparation

Sandwich Preparation

Tool-as-Interface: Learning Robot Policies
from Observing Human Tool Use

Best Paper Award at ICRA 2025 workshop on Foundation Models and Neural-Symbolic (NeSy) AI for Robotics

Media Coverage

Teaser Video

Technical Explaination Video

Abstract

Pan Flipping with Different Objects

Robustness Testing

Robustness to Shaking Camera

Base Robustness, Trajectory Memorization, and End Effector Stabilization Testing

Robustness to Human Perturbation

Precision Tests

Example 3D Reconstruction

3D Reconstruction

Trajectory Smoothness Comparison

Data Collection Methods

Meatball Scooping

Hammering a Nail

Pan Flipping

Wine Balancing

Playing Soccer

Failure Modes

Food Preparation