Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models

1Department of Information Engineering, Università degli Studi di Padova, Italy; 2ABB Corporate Research, Västerås, Sweden.

Abstract

Recent advancements in Large Language Models (LLMs) and Visual Language Models (VLMs) have significantly impacted robotics, enabling high-level semantic motion planning applications. Reinforcement Learning (RL), a complementary paradigm, enables agents to autonomously optimize complex behaviors through interaction and reward signals. However, designing effective reward functions for RL remains challenging, especially in real-world tasks where sparse rewards are insufficient and dense rewards require elaborate design. In this work, we propose Autonomous Reinforcement learning for Complex Human-Informed Environments (ARCHIE), an unsupervised pipeline leveraging GPT-4, a pre-trained LLM, to generate reward functions directly from natural language task descriptions. The rewards are used to train RL agents in simulated environments, where we formalize the reward generation process to enhance feasibility. Additionally, GPT-4 automates the coding of task success criteria, creating a fully automated, one-shot procedure for translating human-readable text into deployable robot skills. Our approach is validated through extensive simulated experiments on single-arm and bi-manual manipulation tasks using an ABB YuMi collaborative robot, highlighting its practicality and effectiveness. Tasks are demonstrated on the real robot setup.

Video

ARCHIE

In this work, we propose Autonomous Reinforcement learning for Complex Human-Informed Environments (ARCHIE). ARCHIE is a practical automatic RL pipeline for training autonomous agents for robotics manipulation tasks, in an unsupervised manner. ARCHIE employs GPT-4 —a popular pretrained LLM— for reward generation from human prompts. We leverage natural language descriptions to generate reward functions via GPT-4, which are then used to train an RL agent in a simulated environment. Our approach introduces a formalization of the reward function that constrains the language model’s code generation, enhancing the feasibility of task learning at the first attempt. Unlike previous methods, we also utilize the language model to define the success criteria for each task, further automating the learning pipeline. Moreover, by properly formalizing the reward functions in shaping and terminal terms, we avoid the need for reward reflection and multiple stages of training in RL. This results in a streamlined, one-shot process translating the user’s text descriptions into deployable skills.

Tuning Rewards

Designing reward functions for Reinforcement Learning (RL) agents is challenging due to numerical instabilities and misalignments. For example, in a pushing task where an agent moves an object to a target position, a simple reward function may include a distance-based penalty and a bonus b for touching the object: $$r(s_t, a_t) = - d + \begin{cases} b \text{ if the agent is touching the object}\\ 0 \text{ otherwise} \end{cases}$$ However, tuning is crucial. In a 2D environment where the agent moves toward the origin, experiments with b=10 and b=1 show that a high b creates a flat reward landscape, leading to poor learning, whereas b=1 results in a well-defined goal and better policy learning. Therefore, even in the presence of the correct reward terms, if the weights are not tuned correctly, the performance of the policy is severely affected.

Formalizing Rewards

We evaluated our reward formalization in a 2D simulation that mimics robotic manipulation. The agent controls a point on a vertical plane and can grasp and move a rectangular object. We tested three tasks: grasp and lift, grasp and slide, and placing. Comparing our method, ARCHIE, with GPT-4-generated rewards, we trained agents using 10 different rewards and monitored their success rates. Results show that agents trained with ARCHIE consistently completed tasks, while those trained with GPT-4’s unrestricted rewards were less reliable. This highlights the effectiveness of our reward formalization in stabilizing policy learning.

Robotics Tasks

We evaluated ARCHIE on 10 robotic manipulation tasks, including pushing, picking, insertion, and dual-arm operations, all shown in the videos. Using natural language descriptions, we generated rewards with ARCHIE and compared them to a baseline distance-based reward function. Each reward trained three SAC agents per task, with results measured by success rate. Our findings show that ARCHIE consistently guided agents to complete tasks with high success rates, while distance-based rewards only succeeded in a few cases. Additionally, GPT-4-generated code policies failed all tasks. These results highlight the importance of structured reward formalization in leveraging LLM capabilities for RL.

BibTeX

@article{turcato2025towardsARL,
      title={Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models}, 
      author={Niccolò Turcato and Matteo Iovino and Aris Synodinos and Alberto Dalla Libera and Ruggero Carli and Pietro Falco},
      year={2025},
      journal={arXiv preprint arXiv:2503.04280},
      url={https://arxiv.org/abs/2503.04280}, 
}