Where RL Works

Games with clear rules and simulators - RL has been enormously successful at games because the environment can be simulated perfectly and the reward signal (score, win/loss) is unambiguous. You can run millions of games overnight. Chess, Go, Atari, StarCraft, Dota 2, poker - all solved or near-solved by RL.

Robotics - RL is the dominant approach for learning motor control policies. Simulating physics enables training in simulation (sim2real: learn in simulation, transfer to the real robot). Boston Dynamics and most academic robotics labs use RL for locomotion and manipulation.

Recommendation systems - framing recommendations as an RL problem captures the long-term effects of recommendations on user engagement that supervised learning misses. The agent's action is what to recommend, the reward is long-term engagement metrics.

Algorithmic trading - executing large orders optimally (minimizing market impact) is a sequential decision problem well-suited to RL. Many institutional trading desks use RL-based execution algorithms.

Data center optimization - Google DeepMind used RL to optimize Google's data center cooling, achieving a 40% reduction in cooling energy.

Where RL Fails (and What to Use Instead)

Hard-to-define reward functions - RL requires a scalar reward signal that tells the agent how well it is doing. If you cannot define this concisely and correctly, the agent will optimize for something you did not intend (reward hacking). "Make users happy" is not a reward function. The reward function design problem is as hard as the RL problem itself in many real-world applications. Consider supervised learning if you have labeled examples of good behavior.

Expensive environment simulation - RL requires millions of environment interactions to learn. If each interaction requires running a real-world experiment (running a drug trial, operating physical hardware, making a financial trade), RL is impractical. Model-based RL (learning a model of the environment and planning within it) can reduce sample requirements, but this adds complexity.

Sample inefficiency - even with simulation, RL requires far more examples than supervised learning to learn comparable tasks. Deep RL agents typically need 10-100 million environment steps to learn what a human learns in minutes of play. This is why RL is primarily used where simulation is cheap or data is abundant.

Static datasets - if you have a fixed dataset and cannot interact with an environment, standard RL does not apply (the agent cannot take actions and observe consequences). Offline RL algorithms (CQL, IQL) attempt to learn policies from static datasets but are more complex and typically underperform online RL when simulation is available.

RLHF: Reinforcement Learning From Human Feedback

RLHF is the technique that transforms a language model from "predicts text plausibly" to "gives helpful, honest, harmless responses." It is the key step in training InstructGPT, ChatGPT, Claude, and most modern conversational AI systems.

The three-step RLHF pipeline:

Step 1: Supervised Fine-Tuning (SFT) - fine-tune the base language model on a dataset of (prompt, ideal response) pairs created by human contractors. This teaches the model the general format and style of helpful responses.

Step 2: Reward Model Training - collect human preference data: for a given prompt, show two model responses and have a human rater indicate which is better. Train a separate "reward model" to predict which response humans will prefer. The reward model learns a scalar score for any (prompt, response) pair.

Step 3: RL Optimization - use PPO to optimize the SFT model to generate responses that maximize the reward model's score, subject to a KL-divergence constraint that prevents the model from straying too far from the SFT model (which prevents reward hacking and degenerate outputs).

The KL constraint is critical: without it, the model quickly learns to generate text that exploits weaknesses in the reward model (outputting strings that get high reward scores but are not actually helpful). The constraint keeps the model close to the supervised baseline while improving along the reward model's signal.

RLHF is why fine-tuned LLMs are useful rather than merely coherent. A model without RLHF will complete text in plausible ways; a model with RLHF will attempt to answer questions helpfully, follow instructions, and decline harmful requests.

When to Choose RL vs Supervised Learning

Use supervised learning when:

You have labeled examples of correct behavior
The task is a single prediction (no sequential decision making)
You do not have access to a simulator or environment

Use RL when:

The task involves sequential decision making over time
Actions have delayed consequences that a single-step prediction cannot capture
You have access to a simulator or can collect environment interactions cheaply
The feedback signal is naturally a reward (score, win/loss, revenue) rather than labeled examples

The practical reality: most real-world problems should use supervised learning first. RL has a high engineering overhead and is much harder to debug. Use RL when supervised learning cannot capture the temporal dependencies that matter for your problem.

Keep Reading

Neural Network Training Guide - the training foundations that RL builds on
How Large Language Models Work - the transformer architecture that RLHF is applied to
Machine Learning Complete Guide for Software Developers - the broader ML landscape

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Reinforcement Learning for Software Developers: A Practical Guide

The Core Idea

Core RL Algorithms (Conceptually)

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Supervised Learning Explained: How Models Learn from Labeled Examples

Where RL Works

Where RL Fails (and What to Use Instead)

RLHF: Reinforcement Learning From Human Feedback

When to Choose RL vs Supervised Learning

Keep Reading

The workspace your team
actually needs

Reinforcement Learning for Software Developers: A Practical Guide

The Core Idea

Core RL Algorithms (Conceptually)

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Gradient Descent Explained: How Machine Learning Models Actually Learn

Supervised Learning Explained: How Models Learn from Labeled Examples

Where RL Works

Where RL Fails (and What to Use Instead)

RLHF: Reinforcement Learning From Human Feedback

When to Choose RL vs Supervised Learning

Keep Reading

The workspace your teamactually needs

The workspace your team
actually needs