KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks

Dynamic Movement Primitives (DMPs) provide a flexible framework for encoding smooth robotic movements; however, they face challenges in integrating multimodal inputs commonly used in robotics like vision and language into their framework. To fully maximize DMPs' potential, it's crucial to enable DMPs to respond to such multimodal inputs and to also broaden their capability to handle object-focused tasks requiring complex motion planning in one shot, as observation occlusion could easily happen mid-execution in such tasks (e.g. knife occlusion in ingredient cutting, piping bag occlusion in cake icing, hand occlusion in dough kneading, etc.). A promising approach is to leverage Vision-Language Models (VLMs), which process multimodal data and grasp high-level concepts. However, they typically lack the knowledge and capabilities to directly infer low-level motion details and instead serve as a bridge between high-level instructions and low-level control. To address this limitation, we propose Keyword Labeled Primitive Selection and Keypoint Pairs Generation Guided Movement Primitives (KeyMPs), a framework that combines VLMs with sequencing DMPs. KeyMPs use VLMs' high-level reasoning to select a reference primitive through Keyword Labeled Primitive Selection and VLMs' spatial awareness to generate spatial scaling parameters used to generalize the overall motion through Keypoint Pairs Generation, ultimately enabling one-shot vision-language guided motion generation that aligns with the intent expressed in the multimodal input. We validate our approach through an occlusion-rich manipulation task, specifically object cutting experiments in both simulated and real-world environments, demonstrating superior performance over other DMP-based methods that integrate VLMs support.

Method

Our proposed framework, KeyMPs, integrates language and vision inputs to generate executable motions by leveraging VLMs and DMPs. It operates through three stages:

Pre-Processing:

The framework begins by collecting two primary types of inputs:
- Language input: User-provided instructions.
- Vision input: Environment or object representations from a camera.
An object detector identifies the object's global coordinates and produces a cropped object image, while the object's height measurement is supplied to the framework. A pixel-based object detection approach determines the global coordinates. Alternative object detection methods are also supported.
Contextual processing:

In keyword labeled primitive selection, VLMs employ high-level reasoning to select a reference primitive from a predefined primitive dictionary.

In keypoint pairs generation, VLMs generate 2D keypoint pairs that, combined with the object's height, define the spatial scaling parameters (\( y_0 \) and \( y_{\text{goal}} \)). These are based on combined language and vision inputs.

You are a 𝑡𝑎𝑠𝑘 and a robot expert. You will be provided with an image of an 𝑜𝑏𝑗𝑒𝑐𝑡 and a user input of 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒. Your job is to select the most suitable 𝒑𝒓𝒊𝒎𝒊𝒕𝒊𝒗𝒆 from a list of 𝒑𝒓𝒊𝒎𝒊𝒕𝒊𝒗𝒆 𝒌𝒆𝒚𝒘𝒐𝒓𝒅𝒔 given the type of 𝒐𝒃𝒋𝒆𝒄𝒕 shown in the image and the user’s 𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒐𝒖𝒕𝒄𝒐𝒎𝒆. Here are the list of 𝑝𝑟𝑖𝑚𝑖𝑡𝑖𝑣𝑒 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠 for this 𝑡𝑎𝑠𝑘 : […] Provide me with the 𝑝𝑟𝑖𝑚𝑖𝑡𝑖𝑣𝑒 𝑘𝑒𝑦𝑤𝑜𝑟𝑑 you selected. Here are some examples: …

You are a 𝑡𝑎𝑠𝑘 and a robot expert. You will be provided with an image of an 𝑜𝑏𝑗𝑒𝑐𝑡 and a user input of 𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒐𝒖𝒕𝒄𝒐𝒎𝒆. Your job is to generate keypoint pairs (lines) design(s) according to the user desired outcome. In this 𝑡𝑎𝑠𝑘, the keypoint pairs represent 𝑣𝑒𝑟𝑏 where the starting keypoint represent the start of 𝑣𝑒𝑟𝑏 and the end keypoint represent the end of 𝑣𝑒𝑟𝑏. To make sure proper keypoint pairs design generation, follow these steps: 1. Identify the 𝑜𝑏𝑗𝑒𝑐𝑡 in the image. 2. Describe the shape of the 𝑜𝑏𝑗𝑒𝑐𝑡 shown in the image (Rectangular? Circular? Object-specific shape?) 3. Describe your design plan to generate keypoint pairs based on the shape in no.2 and the user input to achieve the 𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒐𝒖𝒕𝒄𝒐𝒎𝒆. 4. Make a python code to generate list of lines (list of list of coordinates) based on the plan in no.3. Make sure the code output a JSON file filled with the keypoint pairs within the range of [0, 1]. Here are some examples: …
DMPs motion generation:

The reference primitive and the spatial scaling parameters are used to create the definitive DMP motion. The motion is generated by iteratively applying each spatial scaling parameter to scale the reference primitive.

The rest of the details can be found in our research paper.

Implementation

System Prompt

For this study, we employed GPT-4o, developed by OpenAI, as the foundational vision-language model (VLM). In our object cutting experiment, we utilized the following system prompt to initialize the VLMs.

You are a cooking and robotic expert who output values in JSON format.
You will be given 2 inputs, user text input (let's call this in_text) and an image (in_img) of an object filling the image placed.
Your task is to draw cutting lines on in_img using python to satisfy in_text.

First, let's first describe the object in the image:
(1) tell me the dimension of the image and put it in a key called 'img_dim'.
(2) describe the properties of the object. things like the object's name, whether it is cooked or raw, object's hardness, object's elasticity, object's color, object's shape (square? circle? rectangular? object-specific shape?), object's orientation within the image (vertically longer? horizontally longer? diagonal?), object's average length (if not mentioned specifically), and other things you can describe from the image. put this description in a key called 'object_description'. Make sure you try to acquire the information from the image first before relying on your knowledge base. Try to fill in the remaining properties if you can't see it in the image.
Interpret the object orientation through the image dimension, this will later be important on deciding the pattern's rotation.


Second, describe the object in the picture, then describe in detail how you would cut the object on the image into the desired result and put it in a key called 'description'. Create the design pattern using a combination of one or multiple keypoint pairs (lines).
To make sure the plan is correct, 
-in the first sentence describe the user desired outcome (e.g. equal cut, 5 slices, 8 slices, etc)
-in the second sentence describe the type of design you want to use (grid-based or radial-based) (e.g. grid-based where you cut the objects into rows and columns no matter the cuts are combinations of horizontal+vertical cuts or horizontal+diagonal cuts, radial-based where you the cut is specifically cit like a pie). Only use radial cuts on circular shaped objects unless specifically asked.
-in the third sentence describe your plan in detail based on the design type in the previous sentence on how you would cut the object while considering the properties of the object (described in 'object_description') and the user desired outcome.
For example if I asked for 9 equal slices for an object, if the object shape is circular, then cut it using radial cuts on 360/9=40 degree cuts, meanwhile if the object shape is square/rectangular, then cut it using grids (e.g. 3 rows, 3 columns) for 9 equal cuts. 
Notes:
-Try utilizing your knowledge of the object parts ratio to create the cutting pattern if needed.
-When handling cutting plan with some calculation, try to assert that the (number_of_cuts+1)*length_cut==total_length
-You can use just horizontal cuts OR just vertical cuts OR just diagonal cuts OR grids of vertical and horizontal cuts OR other combinations of horizontal, vertical, & diagonal cuts, just make sure it achieves the user desired outcome correctly, you can try different ones for your plan.

Third, Make me a python code based on 'description' which: 
(1) imports the uploaded image as 'temp/test_img_proposed.png'; 
(2) creates a cutting pattern according to 'description' within the range of 0 to 1;
(3) scales said pattern from the previous step according to the image size; 
(4) outputs a list of list of list of the scaled cutting pattern as a JSON file called 'temp/keypoints.json' with the key being 'keypoints'; 
(5) draws the cutting pattern/lines (blue color) on top of the base image (let's call it 'temp/cropped_img.png') using matplotlib and saves it as 'temp/cut_projection.png'
Notes:
-Put all the python code inside a key called 'code' and make sure it's formatted properly so I can run it immediately.
-Use consistent type of quotes for the code (single quote) and for the JSON file (double quote).
-The inner-most list of 'keypoints' contains 2 values of x and y representing a coordinate on the image space where the x axis corresponds to the image columns and the y axis corresponds to the image rows. The list containing the inner-most list contains 2 lists representing a the start and end position of the line. Finally the outer-most list contains multiple of the list representing a line for each of the line in the cutting pattern. So the list's shape is something like this: [[[line1_start_x, line1_start, y], [line1_end_x, line1_end_y]], [[line2_start_x, line2_start, y], [line2_end_x, line2_end_y]], ..., [[lineN_start_x, lineN_start, y], [lineN_end_x, lineN_end_y]]]

Fourth, I also want you to choose what style of cutting is best to cut the object in the image based on the properties you described in 'object_description'. You have 2 options to choose from: either 'downward' for straight-downward cutting for very soft object (e.g. soft cake, jelly, tofu, banana, easy to cut vegetables, etc) or 'sawing' for a sawing motion for harder, crusty, or elastic object (e.g. french bread, raw chicken or beef or pork or other meats, loaf of bread, etc). Explain your reasoning on how you choose a keyword and put it in 'keyword_reason', then put the selected keyword in 'keyword'.

Final reminder, the keys inside the JSON are img_dim, object_description, description, code, keyword, keyword_reason.

KeyMPs
One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks

Method

Pre-Processing:

Contextual processing:

DMPs motion generation:

Implementation

System Prompt

Videos

Simulation Experiments

Real Robot Experiments (4x speed)

KeyMPsOne-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks

Method

Pre-Processing:

Contextual processing:

DMPs motion generation:

Implementation

System Prompt

Videos

Simulation Experiments

Real Robot Experiments (4x speed)

KeyMPs
One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks