Verifiably Following Complex Robot Instructions with Foundation Models

Supplementary Videos and Demonstrations
Limp teaser image.

This figure visualizes our approach, Language Instruction grounding for Motion Planning (LIMP), executing the instruction: "Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy". LIMP has no semantic information of the environment prior to this instruction, rather at runtime, our approach leverages VLMs and spatial reasoning to detect and ground open-vocabulary instruction referents. LIMP then generates a verifiably correct task and motion plan that enables the robot to navigate from its start location (yellow, A), to the green plush toy (green, B), execute a pick skill which searches for and grasps the object, then navigate to the whiteboard (blue, C) while avoiding the robot in the space (red circles), to finally execute a place skill which sets the object down.

Real-World Demonstration
(All robot videos are 1x speed)

Below are additional real world examples of LIMP generating task and motion plans to follow expressive instructions with complex spatiotemporal constraints. Each demonstration has two videos, the top video visualizes instruction translation, referent grounding, task progression semantic maps and computed motion plans for each example. The bottom video shows a robot executing the generated plan in the real world. Please see our paper for more details on our approach.


Demo 1

Demo 2

Demo 3