Introduction to Reinforcement Learning

March 06, 2026

AI summary available at the end of the page

#RL #Learning #computer-science

This note is almost entirely based on the book Reinforcement Learning: An Introduction by Sutton. I highly encourage reading the book itself for more detailed information and clear explanations.

Reinforcement Learning (RL) refers to problems that involve learning what to do to maximize a numerical reward signal. The three most important features of RL problems are

being closed-loop, meaning the learning system’s actions influence its later inputs.
not having direct instructions as to what actions to take, but instead the agent must discover which actions yield the most reward by trying them out.
the consequences of actions, including reward signals, play out over extended time periods. There are important differences between different types of learning.
- Supervised learning is learning from a training set of labeled examples provided by an external supervisor. The object of this kind of learning is for the system to extrapolate, or generalize, so that it acts correctly in situations not present in the training set.
- Unsupervised learning is typically about finding structures hidden in collections of unlabeled data.
- Reinforcement learning on the other hand is trying to maximize a reward signal instead of trying to find hidden structure. Having an explicit goal and dealing with the exploration-exploitation trade-off are of the most important aspects of RL.

Elements of Reinforcement Learning

The four main sub-elements of a reinforcement learning systems are a policy, a reward signal , a value function, and, optionally, a model of the environment.

A policy defines the agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions taken in those states.
A reward signal defines the goal in an RL problem. The agent’s only objective is to maximize the total reward it receives over the long run.
The value of a state is the expected total amount of reward an agent can accumulate in the future, starting from that state. Action choices are made based on value judgments. Rewards are basically given directly by the environment, but values must be estimated from the sequences of observations.
The final element of some RL systems is a model of the environment. This is something that mimics the behavior of the environment. Models are used for planning. RL methods that use models and planning are called model-based methods, as opposed to model-free methods that are explicitly trial-and-error learners.

You can continue to learn about reinforcement learning by taking a look at the classic multi-arm bandit example here.

Sources:

Reinforcement Learning: An Introduction by Sutton