Reinforcement learning (Sutton & Barto, 1998) is a formal mathematical framework in which an agent manipulates its environment through a series of actions and in response to each action receives a reward value. The agent will try to maximize result by choosing the best reward with internal agent. In the fact the aim of the agent is to get a maximum reward over time. The agent is not taught to decide which road chooses but some signals are given to him to allow decide the best road.
Reinforcement Learning: Supervised or Unsupervised?
As we know supervised learning essence is that you teach your model you give it a labeled input and you control your output, and the other hand the unsupervised learning is mean that you let your model detect the abnormal actions, data, object, in your system by using some algorithms which your model to label data itself. In RL there is no a teacher or algorithm because the agent is relies on itself, by doing action and then see the effect positive of negative.
How RL work?
The Principe of RL is inspired from children’s learning, when we’re before a complex situation where the environment knowledge is limited, and the agent’s behavior is unknown too, children executes random actions in such space to discover the world around, he move randomly, take things, manipulate space (food, water, electric, people reaction…) for example is he did something and her parent being not happy he understand that he did a bad things.
The agent interact with his environment step by step, in each step t = 0, 1, 2, 3, … the agent precepts that the environment is in st state and it make the action at, the environment make an transition to st+1 state and emits an reward of rt+1, the agent look to maximize the reward.
We can formulate this task with this function: RL(s, a) = rt ɛ R, the agent receive this result of the t step and try to maximize it for the next step t+1
Note that the RL goal must be indicated from the beginning, for example when playing draughts the player look to win the game and not loss, so this is the goal of RL and the problem is to choose the actions policy which maximize the totality of the rewards received by the agent. An actions policy corresponds to the function: π(s) = a means that the state generates the action which is executed by the agent. This later search the policy which maximizes the sum of rewards its called π*
The Q learning is the formulation of an Reinforcement learning, the Capital letter Q correspond at the quality of learning as well as the reinforcement function, for each state the agent look to maximize his reward total, so to do this our this later must initiate reward sum and start to do action and estimate reward, Then, with each choice of action, the agent observes the reward and the new state (which depends on the previous state and the current action). The heart of the algorithm is an update of the value function. The definition of the value function is updated at each step as follows:
Where s’ is the new state and s is the previous state a is the action chosen by the agent and r is the reward received, α is learning factor, γ is the discount factor.
How α , γ work ?
As we said α is learning factor, so it represents learning level of our agent, if α = 0 that’ mean that the new state is the same previous, then the agent will not consider it, else if α =1 that’s mean that the new state has different with the previous and then the agent will consider it
And the other hand γ is the discount factor it consider the importance if the next reward if the γ is near to 0 this make the agent myopic, then it will consider the currents rewards and not the next. A γ near to 1 may diverges Q sum.