Reinforcement learning (RL) algorithms face significant challenges when dealing with long-horizon robot manipulation tasks in real-world environments due to sample inefficiency and safety issues. To overcome these challenges, we propose a novel framework, SEED, which leverages two approaches: reinforcement learning from human feedback (RLHF) and primitive skill-based reinforcement learning. Both approaches are particularly effective in addressing sparse reward issues and the complexities involved in long-horizon tasks. By combining them, SEED reduces the human effort required in RLHF and increases safety in training robot manipulation with RL in real-world settings. Additionally, parameterized skills provide a clear view of the agent's high-level intentions, allowing humans to evaluate skill choices before they are executed. This feature makes the training process even safer and more efficient. To evaluate the performance of SEED, we conducted extensive experiments on five manipulation tasks with varying levels of complexity. Our results show that SEED significantly outperforms state-of-the-art RL algorithms in sample efficiency and safety. In addition, SEED also exhibits a substantial reduction of human effort compared to other RLHF methods.
SEED integrates two approaches: (1) learning from human evaluative feedback and (2) primitive skill-based motion control. By breaking down long-horizon tasks into a sequence of primitive skills, evaluative feedback can provide dense training signals, making long-horizon tasks with sparse rewards more tractable. During training, the robot proposes an action in the form of skill and skill parameter selection. The human trainer then evaluates the robot's proposed action and the robot learns to maximize positive evaluation it receives. When the human is confident the robot can make good choices, they will let the robot to execute the action.
Pick (x, y, z)
Place (x, y, z)
Push (x, y, z, d)
SEED agent learns to complete real-world tasks with fewer human feedback. SEED and TAMER are compared for the amount of human effort required. For all tasks, SEED efficiently learns to complete the tasks within the given number of feedbacks, while TAMER failed to complete the tasks. Additionally, higher performance in the second runs of each experiment demonstrates that human trainers can quickly learn to provide better feedback.