Skip to content
/ DAHLIA Public

Official implementation of paper "Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Conditioned Closed-Loop Feedback"

License

Notifications You must be signed in to change notification settings

Ghiara/DAHLIA

Repository files navigation

DAHLIA

Official implementation of paper "Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Guided Closed-Loop Feedback"

Yuan Meng1,, Xiangtong Yao1, Haihui Ye1, Yirui Zhou1, Shengqiang Zhang2,

Zhenshan Bing3,†, Alois Knoll1,

1The School of Computation, Information and Technology, Technical University of Munich, Germany
2Center for Information and Language Processing, Ludwig Maximilian University of Munich, Germany
3State Key Laboratory for Novel Software Technology, Nanjing University, China
Corresponding author: zhenshan.bing@tum.de

Abstract

Recent advances in language-conditioned robotic manipulation have leveraged imitation and reinforcement learning to enable robots to execute tasks from human commands. However, these approaches often suffer from limited generalization, adaptability, and the scarcity of large-scale specialized datasets—unlike data-rich fields such as computer vision—leading to challenges in handling complex long-horizon tasks. In this work, We introduce DAHLIA, a data-agnostic framework for language-conditioned long-horizon robotic manipulation that leverages large language models (LLMs) for real-time task planning and execution. Our framework features a dual-tunnel architecture, where a planner LLM decomposes tasks and generates executable plans, while a reporter LLM provides closed-loop feedback, ensuring adaptive re-planning and robust execution. Additionally, we incorporate temporal abstraction and chain-of-thought (CoT) reasoning to enhance inference efficiency and traceability. DAHLIA achieves superior generalization and adaptability across diverse, unstructured environments, demonstrating state-of-the-art performance in both simulated and real-world long-horizon tasks.

framework

1. Installation

1.1 Cloning

Go to the GitHub repository website and select 'Code' to get an HTTPS or SSH link to the repository. Clone the repository to your device, e.g.

git clone https://github.com/Ghiara/DAHLIA

Enter the root directory of this project on your device. The root directory contains this README-file.

1.2 Build Environment

We recommend to manage the python environment with conda and suggest Miniconda as a light-weight installation.

You may also use different environment tools such as python's venv. Please refer to requirements.txt in this case. In the following, we will proceed with conda.

Install the environment using the conda command:

conda create -n dahlia python=3.9

This might take some time because it needs to download and install all required packages.

Activate the new environment by running (Make sure that no other environment was active before.):

conda activate dahlia

1.3 Install Other Packages (option)

NOTE: DAHLIA principly does not need training. Install Pytorch in case you want to fine-tune the LLM or use CLIPort.

pip install -r requirements.txt
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install pytorch-lightning==1.9.5
python setup.py develop

2. Run code

2.1 Generate Task File (option)

Generate more new tasks if needed.

reference number determines the number of existing task candidates being prompted to LLM.

For top-down generation, fill in task name and task instruction (optional) to specify the desired task.

# bottom-up template generation
python gensim/run_simulation.py \ 
    prompt_folder=bottomup_task_generation_prompt \ 
    save_memory=True load_memory=True \ 
    task_description_candidate_num=[reference number] \ 
    use_template=True

# top-down task generation
python gensim/run_simulation.py \
    prompt_folder=topdown_task_generation_prompt \ 
    save_memory=True load_memory=True \
    task_description_candidate_num=[reference number] \ 
    use_template=True target_task_name=[task name] \
    target_task_description=[task instruction]

# task-conditioned chain-of-thought generation
python gensim/run_simulation.py \
    prompt_folder=topdown_chain_of_thought_prompt \ 
    save_memory=True load_memory=True \ 
    task_description_candidate_num=[reference number] \ 
    use_template=True target_task_name=[task name] \ 
    target_task_description=[task instruction] 

2.2 Execute a Task

To directly try to complete a task (based on existing task file).

Fill in task name to locate the task file.

python cliport/cap.py task=[task name] mode=cap check=False

2.3 Generate Test Task Dataset

DAHLIA uses the same dataset style of CLIPort.

Fill in number of samples to set number of episodes one task dataset should have.

python cliport/demo.py n=[number of samples] \
    task=[task name] mode=test all_result=False\

This will save the test dataset in the folder data.

2.4 Test the Execution

Randomly pick n episodes form test dataset and execute and evaluate, finally show success rate.

python cliport/dahlia_run.py task=[task name] mode=test check=False n=1

3. Prompting

The prompting of Task generation can visit at /prompts/bottomup_task_generation_prompt_new/* or /prompts/topdown_task_generation_prompt/*.

The prompting of DAHLIA role definition can visit at /prompts/dahlia/*

As described in the paper, the prompting of each LMP planner contains basically following parts:

  1. The libraries import, including our predefined APIs and widely used third party libraries(e.g., numpy), for example:
import numpy as np
from env_utils import get_obj_pos, get_obj_rot, parse_position
from utils import get_obj_positions_np, get_obj_rotations_np
from cliport.utils import utils
  1. Methods explanation, which briefly introduce how our customized APIs can be used, for example:
# ---------------------------------------------------------------------------
# Existing Method Explanations
# ---------------------------------------------------------------------------
'''
get_obj_pos(obj) -> [list] # return a list of len(obj) of 3d position-vectors of obj, even when obj is just one object not a list of objects
get_obj_rot(obj) -> [list] # return a list of len(obj) of 4d quaternion orientation-vectors of obj, even when obj is just one object not a list of objects
get_obj_positions_np([obj]) -> [list] # return a list of len([obj]) of 3d position-vectors of obj in [obj]
...etc.
'''
  1. Third part define how our Coordinate system defined related to the robot view, for example:
# ---------------------------------------------------------------------------
# Orientations in Coordinate System
# ---------------------------------------------------------------------------
'''
left: y-
right: y+
front: x+
rear: x-
top: z+
bottom: z-
top left: x-y-
bottom left: x+y-
top right: x-y+
bottom right: x+y+
'''
  1. This part define the general role how the LLM agent can act and response with each other, for example:
# ---------------------------------------------------------------------------
# General Requirements
# ---------------------------------------------------------------------------
'''
You are writing python code for object parsing, refer to the code style in examples below.
You can use the existing APIs above, you must NOT import other packages.
Our coordinate system is 3D cartesian system, but still pay attention to the orientations. 
Also pay attention to the return format requirements in descriptions for some tasks.
When you are not sure about positions, you had better use parse_position(), and clarify your return format demand.
'''
  1. We may introduce task plan examples to help agent adapt the task planning in a few-shot manner, the code for example:
# ---------------------------------------------------------------------------
# Task Examples
# ---------------------------------------------------------------------------

objects = ['blue block', 'cyan block', 'purple bowl', 'gray bowl', 'brown bowl', 'pink block', 'purple block']
# the block closest to the purple bowl.
block_names = ['blue block', 'cyan block', 'purple block']
block_positions = get_obj_positions_np(block_names)
closest_block_idx = get_closest_idx(points=block_positions, point=get_obj_pos('purple bowl')[0])
closest_block_name = block_names[closest_block_idx]
ret_val = closest_block_name

objects = ['brown bowl', 'banana with obj_id 1', 'brown block with obj_id 9', 'apple', 'blue bowl with obj_id 8', 'blue block with obj_id 3']
# the block, return result as list.
ret_val = ['brown block with obj_id 9', 'blue block with obj_id 3']

objects = ['brown bowl', 'banana with obj_id 1', 'brown block with obj_id 9', 'apple', 'blue bowl with obj_id 8', 'blue block with obj_id 3']
# the block color, return result as tuple.
ret_val = ('brown', 'blue')
...

Based on above mention promptings, we can help the agent to build a systematical planning mechanism based on the idea of chain-of-thought.


4. Acknowledgements

This project uses code or idea from open-source projects and datasets including:

GenSim

Origin: https://github.com/liruiw/GenSim
License: MIT

LoHoRavens

Origin: https://github.com/Shengqiang-Zhang/lohoravens
License: Apache 2.0

Code as Policies

Origin: https://github.com/google-research/google-research/tree/master/code_as_policies

CLIPort-batchify

Origin: https://github.com/ChenWu98/cliport-batchify

Google Ravens (TransporterNets)

Origin: https://github.com/google-research/ravens
License: Apache 2.0

OpenAI CLIP

Origin: https://github.com/openai/CLIP
License: MIT

Google Scanned Objects

Origin: Dataset
License: Creative Commons BY 4.0

About

Official implementation of paper "Data-Agnostic Robotic Long-Horizon Manipulation with Vision-Language-Conditioned Closed-Loop Feedback"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published