Autonomous RL with LLM

Abstract

Recent advancements in Large Language Models (LLMs) and Visual Language Models (VLMs) have significantly impacted robotics, enabling high-level semantic motion planning applications. Reinforcement Learning (RL), a complementary paradigm, enables agents to autonomously optimize complex behaviors through interaction and reward signals. However, designing effective reward functions for RL remains challenging, especially in real-world tasks where sparse rewards are insufficient and dense rewards require elaborate design. In this work, we propose Autonomous Reinforcement learning for Complex Human-Informed Environments (ARCHIE), an unsupervised pipeline leveraging GPT-4, a pre-trained LLM, to generate reward functions directly from natural language task descriptions. The rewards are used to train RL agents in simulated environments, where we formalize the reward generation process to enhance feasibility. Additionally, GPT-4 automates the coding of task success criteria, creating a fully automated, one-shot procedure for translating human-readable text into deployable robot skills. Our approach is validated through extensive simulated experiments on single-arm and bi-manual manipulation tasks using an ABB YuMi collaborative robot, highlighting its practicality and effectiveness. Tasks are demonstrated on the real robot setup.

Video

ARCHIE

In this work, we propose Autonomous Reinforcement learning for Complex Human-Informed Environments (ARCHIE). ARCHIE is a practical automatic RL pipeline for training autonomous agents for robotics manipulation tasks, in an unsupervised manner. ARCHIE employs GPT-4 —a popular pretrained LLM— for reward generation from human prompts. We leverage natural language descriptions to generate reward functions via GPT-4, which are then used to train an RL agent in a simulated environment. Our approach introduces a formalization of the reward function that constrains the language model’s code generation, enhancing the feasibility of task learning at the first attempt. Unlike previous methods, we also utilize the language model to define the success criteria for each task, further automating the learning pipeline. Moreover, by properly formalizing the reward functions in shaping and terminal terms, we avoid the need for reward reflection and multiple stages of training in RL. This results in a streamlined, one-shot process translating the user’s text descriptions into deployable skills.

Tuning Rewards

Designing reward functions for Reinforcement Learning (RL) agents is challenging due to numerical instabilities and misalignments. For example, in a pushing task where an agent moves an object to a target position, a simple reward function may include a distance-based penalty and a bonus b for touching the object: $$r(s_t, a_t) = - d + \begin{cases} b \text{ if the agent is touching the object}\\ 0 \text{ otherwise} \end{cases}$$ However, tuning is crucial. In a 2D environment where the agent moves toward the origin, experiments with b=10 and b=1 show that a high b creates a flat reward landscape, leading to poor learning, whereas b=1 results in a well-defined goal and better policy learning. Therefore, even in the presence of the correct reward terms, if the weights are not tuned correctly, the performance of the policy is severely affected.

Formalizing Rewards

We evaluated our reward formalization in simulated environments that mimic the dynamics, observation and action spaces of robotic manipulation. We tested four tasks: grasp and lift, grasp and slide, placing and pushing. We compared our method, ARCHIE, with rewards generated by the first step of Eureka (with GPT-4). We trained agents using 10 different rewards and monitored their success rates. Results show that agents trained with ARCHIE consistently completed tasks, while those trained with GPT-4’s unrestricted rewards were less reliable. This highlights the effectiveness of our reward formalization in stabilizing policy learning.

We remark one example of reward generated by ARCHIE for the Cube Push task in the 3D environment: $$ r(s_t, a_t, s_{t+1}) = -d + 10 \cdot contact + x_t + \underbrace{R_F(s_t, a_t) \Phi(s_{t+1})}_{\text{terminal term}}$$ This is a typical example of reward with correct structure and unbalanced weights. The Python code that implements this reward can be found in the PDF appendix. In the figure on the right we show the learning curves of SAC agents under the reward in , as well as the same reward without the terminal term. All the agents trained with the terminal reward converge to successful solutions, the others instead fail. Again, this highlights the benefits of the proposed formulation.

Robotics Tasks

We evaluated ARCHIE on 10 robotic manipulation tasks, including pushing, picking, insertion, and dual-arm operations, all shown in the videos. Using natural language descriptions, we generated rewards with ARCHIE and compared them to a baseline distance-based reward function. Each reward trained three SAC agents per task, with results measured by success rate. Our findings show that ARCHIE consistently guided agents to complete tasks with high success rates, while distance-based rewards only succeeded in a few cases. Additionally, GPT-4-generated code policies failed all tasks. These results highlight the importance of structured reward formalization in leveraging LLM capabilities for RL.

BibTeX

@ARTICLE{turcato2025towardsARL,
  author={Turcato, Niccolò and Iovino, Matteo and Synodinos, Aris and Dalla Libera, Alberto and Carli, Ruggero and Falco, Pietro},
  journal={IEEE Robotics and Automation Letters}, 
  title={Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models}, 
  year={2025},
  volume={},
  number={},
  pages={1-8},
  keywords={Robots;Training;Reinforcement learning;Codes;Visualization;Pipelines;Service robots;Semantics;Python;Planning;Reinforcement Learning;Deep Learning Methods;AI-Based Methods},
  doi={10.1109/LRA.2025.3589162}}, 
}