darusuna.com

Understanding the Fundamentals of Reinforcement Learning

Written on

Chapter 1: Overview of Reinforcement Learning

Reinforcement learning focuses on quantifying the probability of transitioning from a state (s) to an action (a) and then to a new state (s?). The primary objective is for the algorithm to develop an optimal policy, enabling it to make the best possible action based on the current state. Over time, actions that yield favorable results are reinforced, while those that lead to undesirable outcomes are discouraged.

Optimal Policy

An optimal policy is designed to maximize the average reward over time. It must consider both immediate outcomes and associated costs. Additionally, it adopts a non-myopic approach, weighing near-term effects more heavily than those of the distant future.

Reinforcement Learning Setup

To effectively model the current state, we discretize each continuous value into n bins.

Discretization of continuous values in reinforcement learning

This discretization process is also applied to actions.

Action discretization process in reinforcement learning

As a result, we create a state-action matrix and establish a function known as the Q Function, which represents the value of taking action (a) in state (s). The aim of reinforcement learning is to accurately learn this Q Function to gauge the reward associated with actions taken in various states.

Visual representation of the Q Function in reinforcement learning

Solving the Q Function

To solve for the Q Function, we start by initializing the Q table either with known state and action values or randomly. The next step involves selecting an action and observing the resulting reward. Based on this feedback, we update the Q table accordingly.

Initializing the Q table for reinforcement learning

The Q Table is then modified based on the difference between the received reward and the previous Q-function value. If the reward (r) exceeds the old Q-function value, we increase the action's value for the new Q-function; otherwise, we decrease it.

Updating the Q Table based on reward feedback

The learning rate (a) is a crucial parameter that adjusts the Q-function value based on observations, with values ranging between 0 and 1.

Learning rate adjustment in Q learning

Temporal Difference in Q Learning

The temporal difference method compares the immediate reward (r) with the old Q-function value derived from taking action (a) in state (s). A positive temporal difference indicates that the immediate reward is greater than previously estimated, suggesting that the value of taking action (a) in state (s) is underestimated. Consequently, we increase the value of that action in the Q table.

Limitations of the Q Function

One significant limitation of the Q Function is its focus solely on predicting immediate rewards (r), neglecting potential future scenarios. To address this, it must be expanded to consider both immediate rewards and long-term outcomes, promoting a non-myopic policy.

Curious about learning more? Discover the extensive range of topics available on Medium and support writers like me for the price of a coffee!

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Transformative Life Events: Lessons Learned from 2021

Discover five pivotal events from 2021 that reshaped my life and the insights gained for a brighter 2022.

Forgotten Web 2.0 Giants: A Look Back at Nine Failed Sites

A nostalgic exploration of nine once-prominent Web 2.0 sites that have now faded into obscurity.

Unlocking Love: Overcoming Barriers to Connection

Explore the common obstacles to finding love and learn how to overcome them for healthier relationships.