Tool-as-Interface: Learning Robot Tool Use
from Human Play through Imitation Learning

1University of Illinois Urbana-Champaign, 2Columbia University,

Autonomous Pan Flipping

Abstract

Tool use is critical for enabling robots to perform complex real-world tasks, and leveraging human tool-use data can be instrumental for teaching robots. However, existing data collection methods like teleoperation are slow, prone to control delays, and unsuitable for dynamic tasks. In contrast, human play—where humans directly perform tasks with tools—offers natural, unstructured interactions that are both efficient and easy to collect. Building on the insight that humans and robots can share the same tools, we propose a framework to transfer tool-use knowledge from human play to robots. Using two RGB cameras, our method generates 3D reconstruction, applies Gaussian splatting for novel view augmentation, employs segmentation models to extract embodiment-agnostic observations, and leverages task-space tool-action representations to train visuomotor policies. We validate our approach on diverse real-world tasks, including meatball scooping, pan flipping, wine bottle balancing, and other complex tasks. Our method achieves a 71% higher average success rate compared to diffusion policies trained with teleoperation data and reduces data collection time by 77%, with some tasks solvable only by our framework. Additionally, our method bridges the embodiment gap, improves robustness to variations in camera viewpoints and robot configurations, and generalizes effectively across objects and spatial setups.

Pan Flipping with Different Objects

Pan flipping involves dynamically flipping various objects, such as eggs, burger buns, and meat patties. This demonstrates precision, agility, and the ability to adapt to different challenges in motion control. Our framework enables robots to learn highly dynamic movements.

Flipping an Egg
Flipping an Burger Bun
Flipping an Meat Patty

Robustness Testing

Our framework is robust to various perturbations, including a moving camera, moving base, and human perturbations.

Robustness to Shaking Camera

Nail Hammering
Meatball Scooping
Pan Flipping - Egg

Base Robustness, Trajectory Memorization, and End Effector Stabilization Testing

Testing Robustness to a Shaking Base
Trajectory Memorization Test: Camera Shaking Followed by Black Input
Testing if the Policy can Act Like Chickhead with Stable End Effector When the Base is Moving

Robustness to Human Perturbation

Nail Tracking
Human Throwing Meatball
Human Flipping Back Multiple Times

Precision Tests

This section highlights the precision capabilities of our framework in tasks that require high accuracy, such as hammering a nail and wine balancing.

Hammering a Nail
Wine Balancing

Example 3D Reconstruction

This example demonstrates a 3D reconstruction generated using MASt3R. The model is created from the two input images shown below. You can interact with the 3D model by rotating and zooming in/out.

Input Image 1
Input Image 1
Input Image 2
Input Image 2

3D Reconstruction

Trajectory Smoothness Comparison

Human manipulation is inherently more natural and intuitive. Consequently, our policy generates significantly faster and smoother action trajectories compared to policies trained using teleoperated devices.

Ours
Policy Trained Using Space Mouse Collected Demonstration

Data Collection Methods

Human manipulation showcases unparalleled versatility, ranging from delicate precision tasks to intense, contact-rich interactions and dynamic, high-speed maneuvers—none of which are effectively presented by traditional teleportation systems.

Meatball Scooping

Hammering a Nail

Pan Flipping

Wine Balancing

Playing Soccer

Failure Modes

Our framework may fail to complete the task under certain conditions. If the camera moves too quickly, the system can lose track of key visual cues, leading to task failure. In pan flipping, a burger bun might bounce out of the pan when the robot applies excessive upward force.

Fast Camera Failure
Burger Bun Out of Pan

Food Preparation

Our trained pan-flipping policy assists in food preparation, collaborating with humans to flip meat patties or eggs for burgers and sandwiches.

Burger Preparation
Sandwich Preparation