Reinforcement learning is the branch of machine learning concerned with training agents to take sequences of actions in order to maximize cumulative reward. It is behind some of the most impressive AI achievements — AlphaGo defeating the world Go champion, OpenAI Five beating the world's best Dota 2 players, robots learning to walk and manipulate objects. It is also behind ChatGPT's ability to be helpful rather than merely coherent. This guide demystifies RL for software developers without drowning you in mathematical notation.
The Core Idea
In supervised learning, you provide labeled examples (input, correct output) and the model learns to predict the correct output for new inputs. The signal is direct and immediate: the model was right or wrong on this specific example.
In reinforcement learning, there are no labeled examples. Instead:
- An agent (the learning system) takes actions in an environment
- The environment returns a reward (positive for good outcomes, negative for bad ones) and a new state
- The agent's goal is to learn a policy (a mapping from states to actions) that maximizes total reward over time
The signal is indirect and delayed: the agent does not know which specific action was responsible for a reward it received three steps later. This is the credit assignment problem, and it is what makes RL hard.
A concrete example: training an RL agent to play Atari Breakout. The state is the raw pixel image of the screen. The actions are left, right, and fire. The reward is the score increase when a brick is broken. The agent does not know that bouncing the ball at a specific angle toward a cluster of bricks is a good strategy — it must discover this by playing millions of games and observing which sequences of actions tend to lead to higher scores.
Core RL Algorithms (Conceptually)
Q-Learning / DQN — the agent learns the Q-function: the expected total future reward of taking action A in state S. Given the Q-function, the optimal policy is to always take the action with the highest Q-value. DQN (Deep Q-Network) approximates the Q-function with a neural network and was the algorithm that first learned to play Atari games at human level.
Policy Gradient (REINFORCE, PPO) — instead of learning a value function, directly learn the policy (the mapping from states to actions) by gradient ascent on expected reward. PPO (Proximal Policy Optimization) is the most widely used policy gradient algorithm in practice because of its stability and simplicity. It is the base RL algorithm in most RLHF implementations.
Actor-Critic (A3C, SAC) — combines policy learning (the actor) and value estimation (the critic). The critic evaluates how good the current state is, providing a lower-variance training signal for the actor. SAC (Soft Actor-Critic) is particularly effective for continuous action spaces (robotics).
You do not need to implement these algorithms from scratch. Libraries like Stable Baselines 3 provide production-ready implementations:
from stable_baselines3 import PPO
import gymnasium as gym
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, truncated, info = env.step(action)
if done or truncated:
obs, _ = env.reset()
Where RL Works
Games with clear rules and simulators — RL has been enormously successful at games because the environment can be simulated perfectly and the reward signal (score, win/loss) is unambiguous. You can run millions of games overnight. Chess, Go, Atari, StarCraft, Dota 2, poker — all solved or near-solved by RL.
Robotics — RL is the dominant approach for learning motor control policies. Simulating physics enables training in simulation (sim2real: learn in simulation, transfer to the real robot). Boston Dynamics and most academic robotics labs use RL for locomotion and manipulation.
Recommendation systems — framing recommendations as an RL problem captures the long-term effects of recommendations on user engagement that supervised learning misses. The agent's action is what to recommend, the reward is long-term engagement metrics.
Algorithmic trading — executing large orders optimally (minimizing market impact) is a sequential decision problem well-suited to RL. Many institutional trading desks use RL-based execution algorithms.
Data center optimization — Google DeepMind used RL to optimize Google's data center cooling, achieving a 40% reduction in cooling energy.
Where RL Fails (and What to Use Instead)
Hard-to-define reward functions — RL requires a scalar reward signal that tells the agent how well it is doing. If you cannot define this concisely and correctly, the agent will optimize for something you did not intend (reward hacking). "Make users happy" is not a reward function. The reward function design problem is as hard as the RL problem itself in many real-world applications. Consider supervised learning if you have labeled examples of good behavior.
Expensive environment simulation — RL requires millions of environment interactions to learn. If each interaction requires running a real-world experiment (running a drug trial, operating physical hardware, making a financial trade), RL is impractical. Model-based RL (learning a model of the environment and planning within it) can reduce sample requirements, but this adds complexity.
Sample inefficiency — even with simulation, RL requires far more examples than supervised learning to learn comparable tasks. Deep RL agents typically need 10-100 million environment steps to learn what a human learns in minutes of play. This is why RL is primarily used where simulation is cheap or data is abundant.
Static datasets — if you have a fixed dataset and cannot interact with an environment, standard RL does not apply (the agent cannot take actions and observe consequences). Offline RL algorithms (CQL, IQL) attempt to learn policies from static datasets but are more complex and typically underperform online RL when simulation is available.
RLHF: Reinforcement Learning From Human Feedback
RLHF is the technique that transforms a language model from "predicts text plausibly" to "gives helpful, honest, harmless responses." It is the key step in training InstructGPT, ChatGPT, Claude, and most modern conversational AI systems.
The three-step RLHF pipeline:
Step 1: Supervised Fine-Tuning (SFT) — fine-tune the base language model on a dataset of (prompt, ideal response) pairs created by human contractors. This teaches the model the general format and style of helpful responses.
Step 2: Reward Model Training — collect human preference data: for a given prompt, show two model responses and have a human rater indicate which is better. Train a separate "reward model" to predict which response humans will prefer. The reward model learns a scalar score for any (prompt, response) pair.
Step 3: RL Optimization — use PPO to optimize the SFT model to generate responses that maximize the reward model's score, subject to a KL-divergence constraint that prevents the model from straying too far from the SFT model (which prevents reward hacking and degenerate outputs).
The KL constraint is critical: without it, the model quickly learns to generate text that exploits weaknesses in the reward model (outputting strings that get high reward scores but are not actually helpful). The constraint keeps the model close to the supervised baseline while improving along the reward model's signal.
RLHF is why fine-tuned LLMs are useful rather than merely coherent. A model without RLHF will complete text in plausible ways; a model with RLHF will attempt to answer questions helpfully, follow instructions, and decline harmful requests.
When to Choose RL vs Supervised Learning
Use supervised learning when:
- You have labeled examples of correct behavior
- The task is a single prediction (no sequential decision making)
- You do not have access to a simulator or environment
Use RL when:
- The task involves sequential decision making over time
- Actions have delayed consequences that a single-step prediction cannot capture
- You have access to a simulator or can collect environment interactions cheaply
- The feedback signal is naturally a reward (score, win/loss, revenue) rather than labeled examples
The practical reality: most real-world problems should use supervised learning first. RL has a high engineering overhead and is much harder to debug. Use RL when supervised learning cannot capture the temporal dependencies that matter for your problem.
Keep Reading
- Neural Network Training Guide — the training foundations that RL builds on
- How Large Language Models Work — the transformer architecture that RLHF is applied to
- Machine Learning Complete Guide for Software Developers — the broader ML landscape
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.