We train robots to solve general tasks using only images. We present the robot with an image of its desired goal configuration, and it learns to reach that goal image using only images of the environment.
We train an AI agent to control the robot using deep reinforcement learning (RL). To use RL to solve this task, we need a way of telling the agent how close it is to the goal using only images. This requires computing distances between images in a way that tells the agent how close it is to the goal. We demonstrate that simple methods are sufficient to solve a variety of robotic manipulation tasks directly from images, both in simulation and on a physical Kinova Gen 3 arm.