Notes on finite MDPs, Bellman equations and policy improvement à la Sutton/Barto

The book 'Reinforcement Learning: An Introduction' by Sutton and Barto is the standard text book for introductory courses to reinforcement learning. Next to concrete algorithms and extensive examples the book contains several fundamental results related to Markov decision processes (MDPs) and Bellman equations in Chapters 3 and 4. Unfortunately some proofs are missing, some theorems lack precise formulation, and for some results the line of arguments is quite garbled. In this note we provide all missing proofs, give precise formulations of theorems and untangle the line of arguments. Further, we avoid using random variables and their expected values. Since we (like Sutton/Barto) restrict our attention to finite MDPs all expected values can be made explicit avoiding overloaded notation and murky conclusions. This article bridges the gap between introductory literature like Sutton/Barto and research literature containing exact formulations and proofs of relevant results, but being less accessible to beginners due to higher generality and complexity.

Metadaten
Author:	Jens Flemming GND
DOI:	https://doi.org/10.34806/0r07-dt19
Document Type:	Report
Language:	English
Date of Publication (online):	2024/02/03
Year of first Publication:	2024
Publishing Institution:	Westsächsische Hochschule Zwickau
Tag:	Bellman equations; Markov decision processes; dynamic programming; policy improvement theorem; reinforcement learning
Page Number:	15
Faculty:	Westsächsische Hochschule Zwickau / Physikalische Technik, Informatik
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
open_access (DINI-Set):	open_access
Release Date:	2024/02/13
Licence (German):	Creative Commons - Namensnennung-Weitergabe unter gleichen Bedingungen

Open Access