Primitive Skill-Based Robot Learning from Human Evaluative Feedback

IROS 2023

1Department of Mechanical Engineering, 2Department of Computer Science, 3Institute of Human-Centered Artificial Intelligence (HAI)
Stanford University
Equal contribution, alphabetically ordered

Evaluation without Execution: SEED enables safe and sample-efficient learning in the real world by leveraging a skill-based action space. With this intuitive representation, humans can evaluate robot actions even before they are executed. Here, the next goal is to pick up the broom. Human evaluates robot's action choice purely from the skill and parameter visualizations (without the need for watching the robot take action).
Sample rollouts on long-horizon, real world tasks. Trained SEED agent successfully completes the task without human guidance, and is able to recover from failed subgoals (e.g. unsuccessful picking of sausage or pushing drawer).

Abstract

Reinforcement learning (RL) algorithms face significant challenges when dealing with long-horizon robot manipulation tasks in real-world environments due to sample inefficiency and safety issues. To overcome these challenges, we propose a novel framework, SEED, which leverages two approaches: reinforcement learning from human feedback (RLHF) and primitive skill-based reinforcement learning. Both approaches are particularly effective in addressing sparse reward issues and the complexities involved in long-horizon tasks. By combining them, SEED reduces the human effort required in RLHF and increases safety in training robot manipulation with RL in real-world settings. Additionally, parameterized skills provide a clear view of the agent's high-level intentions, allowing humans to evaluate skill choices before they are executed. This feature makes the training process even safer and more efficient. To evaluate the performance of SEED, we conducted extensive experiments on five manipulation tasks with varying levels of complexity. Our results show that SEED significantly outperforms state-of-the-art RL algorithms in sample efficiency and safety. In addition, SEED also exhibits a substantial reduction of human effort compared to other RLHF methods.

Overview


SEED integrates two approaches: (1) learning from human evaluative feedback and (2) primitive skill-based motion control. By breaking down long-horizon tasks into a sequence of primitive skills, evaluative feedback can provide dense training signals, making long-horizon tasks with sparse rewards more tractable. During training, the robot proposes an action in the form of skill and skill parameter selection. The human trainer then evaluates the robot's proposed action and the robot learns to maximize positive evaluation it receives. When the human is confident the robot can make good choices, they will let the robot to execute the action.

Method

Parameterized Primitive Skills

GIF 1

Pick (x, y, z)

GIF 2

Place (x, y, z)

GIF 3

Push (x, y, z, d)


We equip the robot with a set of parameterized primitive skills. This allows the control policy to focus on learning skill and parameter selection without the burden of learning low-level motor control.

SEED Model


  • Hierarchical framework: the skill policy first selects the skill, and the parameter policy selects its parameters.
  • Unique parameter policy for each skill: the policy corresponding to the selected skill is invoked.
  • Human evaluation as reward signal: discrete human evaluation is used instead of an environment reward.
  • Balanced replay buffer: we sample an equal number of "good" and "bad" samples in each batch during learning to compensate for "bad" actions filling the replay buffer during the initial stage of training.

Results

PLT 1
PLT 2
PLT 3

SEED agent learns to complete real-world tasks with fewer human feedback. SEED and TAMER are compared for the amount of human effort required. For all tasks, SEED efficiently learns to complete the tasks within the given number of feedbacks, while TAMER failed to complete the tasks. Additionally, higher performance in the second runs of each experiment demonstrates that human trainers can quickly learn to provide better feedback.