Reinforcement learning is the branch of machine learning concerned with training agents to take sequences of actions in order to maximize cumulative reward. It is behind some of the most impressive AI achievements - AlphaGo defeating the world Go champion, OpenAI Five beating the world's best Dota 2 players, robots learning to walk and manipulate objects. It is also behind ChatGPT's ability to be helpful rather than merely coherent. This guide demystifies RL for software developers without drowning you in mathematical notation.
The Core Idea
In supervised learning, you provide labeled examples (input, correct output) and the model learns to predict the correct output for new inputs. The signal is direct and immediate: the model was right or wrong on this specific example.
In reinforcement learning, there are no labeled examples. Instead:
- An agent (the learning system) takes actions in an environment
- The environment returns a reward (positive for good outcomes, negative for bad ones) and a new state
- The agent's goal is to learn a policy (a mapping from states to actions) that maximizes total reward over time
The signal is indirect and delayed: the agent does not know which specific action was responsible for a reward it received three steps later. This is the credit assignment problem, and it is what makes RL hard.
A concrete example: training an RL agent to play Atari Breakout. The state is the raw pixel image of the screen. The actions are left, right, and fire. The reward is the score increase when a brick is broken. The agent does not know that bouncing the ball at a specific angle toward a cluster of bricks is a good strategy - it must discover this by playing millions of games and observing which sequences of actions tend to lead to higher scores.
Core RL Algorithms (Conceptually)
Q-Learning / DQN - the agent learns the Q-function: the expected total future reward of taking action A in state S. Given the Q-function, the optimal policy is to always take the action with the highest Q-value. DQN (Deep Q-Network) approximates the Q-function with a neural network and was the algorithm that first learned to play Atari games at human level.
Policy Gradient (REINFORCE, PPO) - instead of learning a value function, directly learn the policy (the mapping from states to actions) by gradient ascent on expected reward. PPO (Proximal Policy Optimization) is the most widely used policy gradient algorithm in practice because of its stability and simplicity. It is the base RL algorithm in most RLHF implementations.
Actor-Critic (A3C, SAC) - combines policy learning (the actor) and value estimation (the critic). The critic evaluates how good the current state is, providing a lower-variance training signal for the actor. SAC (Soft Actor-Critic) is particularly effective for continuous action spaces (robotics).
You do not need to implement these algorithms from scratch. Libraries like Stable Baselines 3 provide production-ready implementations:
from stable_baselines3 import PPO
import gymnasium as gym
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, truncated, info = env.step(action)
if done or truncated:
obs, _ = env.reset()