|
Algorithms for Reinforcement Learning, my new sleek book was published by Morgan & Claypool in July 2010.
Download the draft
(last update: Feb 14, 2012),
or download the original
from the publisher's webpage (if you have access).
|
|
Why this book?
There exist a good number of really great books on Reinforcement Learning. So why a new book? I have to confess: The book arose from selfish reasons: I wanted a short book, which nevertheless contained the major ideas underlying state-of-the-art RL algorithms, a discussion of their relative strengths and weaknesses, with hints on what is known (and not known, but would be good to know) about these algorithms. If I succeeded, time will tell. Or, you can, by sending me an e-mail!Abstract
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.Table of contents
| Preface | ix |
| Acknowledgments | xiii |
| 1 Markov Decision Processes | 1 |
| 1.1 Preliminaries | 1 |
| 1.2 Markov Decision Processes | 1 |
| 1.3 Value functions | 6 |
| 1.4 Dynamic programming algorithms for solving MDPs | 10 |
| 2 Value Prediction Problems | 11 |
| 2.1 Temporal difference learning in finite state spaces | 11 |
| 2.1.1 Tabular TD(0) | 11 |
| 2.1.2 Every-visit Monte-Carlo | 14 |
| 2.1.3 TD(lambda): Unifying Monte-Carlo and TD(0) | 16 |
| 2.2 Algorithms for large state spaces | 18 |
| 2.2.1 TD(lambda) with function approximation | 22 |
| 2.2.2 Gradient temporal difference learning | 25 |
| 2.2.3 Least-squares methods | 27 |
| 2.2.4 The choice of the function space | 33 |
| 3 Control | 37 |
| 3.1 A catalog of learning problems | 37 |
| 3.2 Closed-loop interactive learning | 38 |
| 3.2.1 Online learning in bandits | 38 |
| 3.2.2 Active learning in bandits | 40 |
| 3.2.3 Active learning in Markov Decision Processes | 41 |
| 3.2.4 Online learning in Markov Decision Processes | 42 |
| 3.3 Direct methods | 47 |
| 3.3.1 Q-learning in finite MDPs | 47 |
| 3.3.2 Q-learning with function approximation | 49 |
| 3.4 Actor-critic methods | 52 |
| 3.4.1 Implementing a critic | 54 |
| 3.4.2 Implementing an actor | 56 |
| 4 For Further Exploration | 63 |
| 4.1 Further reading | 63 |
| 4.2 Applications | 63 |
| 4.3 Software | 64 |
| A The Theory of Discounted Markovian Decision Processes | 65 |
| A.1 Contractions and Banach’s fixed-point theorem | 65 |
| A.2 Application to MDPs | 69 |
| Bibliography | 73 |
| Author's Biography | 89 |
List of algorithms
The following algorithms are explained in the book. For algorithms whose names are boldfaced a pseudocode is also given.
| Value iteration | p. 10 | Policy iteration | p. 10 | Tabular TD(0) | p. 12 |
| Every-visit Monte-Carlo | p. 15 | Tabular TD(lambda), accumulating traces | p. 17 | First-visit Monte-Carlo | p. 17 |
| Tabular TD(lambda), replacing traces | pp. 17-18 | TD(lambda) with function approximation | p. 22 | GTD2 | p. 26 |
| TDC | p. 27 | LSTD | p. 28 | RLSTD | pp. 28-29 |
| (R)LSTD(lambda) | pp. 29-30 | lambda-LSPE | pp. 30-31 | Boltzmann exploration | p. 38 |
| epsilon-greedy | p. 39 | UCB1 | pp. 39-40 | Action elimination | p. 41 |
| E^3 | p. 42 | UCRL2 | pp. 42-44 | R-max, MBIE, OI, delayed-Q, MorMax | p. 44 |
| Tabular Q-learning | p. 47 | Q-learning with function approximation | pp. 49-50 | State-aggregation based Q-learning | p. 50 |
| Soft-state aggregation based Q-learning | p. 50 | Fitted Q-iteration | pp. 51-52 | Generalized policy iteration | pp. 52-53 |
| SARSA | p. 54 | SARSA(lambda) | pp. 55-55 | LSTDQ(lambda) | pp. 55-56 |
| LSPI(lambda) | p. 57 | REINFORCE | p. 59 | Actor-critic with SARSA(1) | pp. 60-61 |
| Natural actor-critic | p. 57 |
Other unique features of the book
The book discusses the following function approximation methods: Linear function approximation, tensor product construction, kernel smoothing, averagers, state aggregation, tile coding, nearest-neighbor methods, nonparametric kernel smoothing, gaussian processes, and RKHS regression.In addition, it discusses the relative merits of "batch" (LS-type) and incremental (TD-type) algorithms, the influence of the choice of the function approximation method (can we overfit in reinforcement learning?), various theoretically well-founded online learning algorithms (ever wondered about what an efficient exploration method should do?), actor-critic algorithms, and more.
Some connections to other parts of the literature (outside of machine learning) are mentioned. This includes connection of LSTD (and related methods) to Z-estimation (from statistics), sample-average approximation methods (from operations research), or the connection of policy gradient algorithms to likelihood ratio methods (from simulation optimization). In general, the book has many pointers to the literature. I think that the books provides a very good starting point if someone wants to know the status of the theory related to some algorithm or idea. This being said, the book cites 207 works, many of which are quite recent.
The book has even some new results! These include:
- TD(lambda) with linear function approximation solves a model (previously, this was known for lambda=0 only)
- A new bound on the complexity of active learning in finite deterministic MDPs, which significantly improves a previous bound by Sebastian Thrun.

