Fundamentals of Reinforcement Learning

Premkumar Vemula
4 min readMay 31, 2021

Getting Started

image credit: cdotrends.com

In this article, we will understand the basics of Reinforcement Learning which will lay the foundation to understanding each of its algorithms which I will write in separate blogs one for each algorithm.

Reinforcement Learning is a computational approach to goal-directed learning from interaction with the environment, using idealized learning situations. We have an environment and we have an agent. And the environment exposes the state with the agent’s consent and the agent senses that state and then takes an action. The environment processes the action and then produces two things, a reward, and a new state. And the cycle continues, the agent then senses the new state, produces a new action, and so on. The other major influence on RL has been optimal control theory, a term used to describe the problem of designing a controller to minimize the error of a dynamical system. In the 1950s, Richard Bellman devised an approach to the problem, involving the dynamical system’s state and a value function, based on an equation, which today we call the Bellman equation. Bellman also introduced the discrete stochastic version of the problem known as the Markov decision process.

Reinforcement Learning is different from the other two types of Machine Learning algorithm. In supervised learning, the goal is to learn to predict the Y label, given the associated X data, in such a way that the learning generalizes to unseen data beyond the training data. For the training data, the Y label’s been supplied by a domain expert. This is sometimes called learning with a teacher since the right or correct answer is given. Unsupervised learning attempts to learn the structure hidden in a data set. Towards a predefined type of representation like clustering, anomaly detection or independent representation. Unsupervised learning takes no action and receives no feedback, it just operates on the X data set. In reinforcement learning, an agent must learn which action to select at each timestamp. Receiving a reward as feedback, usually sparse in nature. This feedback instead of being the correct answer is a scalar number representing the relative goodness of the sequences of action recently taken usually without a firm’s starting point for the sequence. The agent must learn the sequence that gives the highest total reward through trial and error. Also in reinforcement learning, the next state and reward functions are usually stochastic. The same action and the same state may produce different rewards and different next states for the agent. We consider reinforcement learning to be a third type of machine learning. Even though it may use supervised learning, or unsupervised learning as part of its method, it is a distinct type of learning with its own set of challenges and methods.

There are few elements of Reinforcement Learning. The time step divides time into discrete steps. And each of these steps determines a cycle in the environment-agent interaction. And we usually denote this time by t. The environment is what defines the world that the agent interacts with, and it has a basic loop that it follows. It produces a state and a reward for the agent to sense and process. And then it accepts action from the agent and cycles back to produce another state again. The agent learns to achieve goals by interacting with the environment. Its basic loop is it senses the state and the reward from the environment, and then selects an action to pass to the environment, and then continues that in a loop. So state represents the situation in the environment that the agent is going to make his actions based on.

There are two approaches to solving an RL problem. The first approach is valued function methods, and these are methods where we estimate the value of the states or state-action pairs. And then our policy is based on selecting the actions that lead to the largest value states. The other set of methods is direct policy search. And in this case, we model the policy itself. The input is typically a state or something we approximate as a state, and the output is the action we wanna take, either a discrete action or a continuous action. And model parameters are adjusted in the direction of the greatest policy improvement. Before getting into the algorithms defining few terms formally is important.

Action (A): All the possible moves that the agent can take.

State (S): All the information that the agent has about the environment.

Reward (R): An immediate return sends back from the environment to evaluate the last action.

Policy (π)): It is the mapping from the states to probability of selecting each possible action.
π(a|s) = Probability that Aₜ = a if Sₜ = s

Value_{π}(s): The value function of a state s under a policy π, it is expected return when starting in state s and following π thereafter.

Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current action a. Q^π(s, a) refers to the long-term return of the current state `s’, taking action `a’ and following policy π thereafter.

Link to my next article Marcov decision Process on the same topic.

--

--