Distinguished lecture: Dimitri Bertsekas, Arizona State University
Date and time: 17 November 2021
17:00-18:30 Stockholm CET (UTC +1)
11:00-12:30 New York EST (UTC -5)
09:00-10:30 Tempe MST (UTC -7)
Speaker: Dimitri Bertsekas, MIT and Arizona State University
Title: Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control
Location: Zoom https://kth-se.zoom.us/j/69560887455
Meeting ID: 695 6088 7455
Watch the recorded presentation:
Abstract: Some of the most exciting success stories in reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained using experience with the off-line player. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it greatly improves the performance of the off-line player.
Similarly, TD-Gammon performs on-line a policy improvement step using one-step or two-step lookahead minimization, which is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that uses the position evaluator).
A major lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line approximation in value space, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. The performance enhancement is often dramatic and is due to a simple fact, which is the focal point of this work:
On-line approximation in value space with one-step lookahead amounts to a step of Newton’s method for solving Bellman’s equation, while the starting point for the Newton step is based on the results of off-line training, and is enhanced by longer on-line lookahead minimization and on-line rollout.
Indeed the major determinant of the quality of the controller thus obtained is the Newton step that is performed on-line, while off-line training is a secondary contributor by comparison. This conceptual view of RL can be understood intuitively in terms of abstract models of infinite horizon DP and simple geometrical constructions. It manifests itself in model predictive control, and it helps to clarify the all-important stability issues within that context. Moreover on-line approximation in value space works well with changing problem parameters and on-line replanning, similar to indirect adaptive control. Here the Bellman equation is perturbed due to the parameter changes, but approximation in value space still operates as a Newton step.
In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. One of our principal aims is to show through the unifying principles of abstract DP and the algorithmic ideas of Newton’s method that the AlphaZero/TD-Gammon ideas apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.
This talk is based on a research monograph to be published sometime in 2022. An early draft of the monograph can be found on-line at http://web.mit.edu/dimitrib/www/abstractdp_MIT.html
Bio: Dimitri Bertsekas’ undergraduate studies were in engineering at the National Technical University of Athens, Greece. He obtained his MS in electrical engineering at the George Washington University, Wash. DC in 1969, and his Ph.D. in system science in 1971 at the Massachusetts Institute of Technology (M.I.T.).
Dr. Bertsekas has held faculty positions with the Engineering-Economic Systems Dept., Stanford University (1971-1974) and the Electrical Engineering Dept. of the University of Illinois, Urbana (1974-1979). From 1979 to 2019 he was with the Electrical Engineering and Computer Science Department of M.I.T., where he served as McAfee Professor of Engineering. In 2019, he was appointed Fulton Professor of Computational Decision Making, and a full time faculty member at the School of Computing and Augmented Intelligence at Arizona State University (ASU), Tempe. His research spans several fields, including optimization, control, and large-scale computation, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and eighteen books and research monographs, several of which are used as textbooks in M.I.T. classes.
Professor Bertsekas was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book “Neuro-Dynamic Programming”, the 2001 ACC John R. Ragazzini Education Award, the 2009 INFORMS Expository Writing Award, the 2014 ACC Richard E. Bellman Control Heritage Award for “contributions to the foundations of deterministic and stochastic optimization-based methods in systems and control,” the 2014 Khachiyan Prize for Life-Time Accomplishments in Optimization, the SIAM/MOS 2015 George B. Dantzig Prize, and the 2022 IEEE Control Systems Award. Together with his coauthor John Tsitsiklis, he was awarded the 2018 INFORMS John von Neumann Theory Prize, for the contributions of the research monographs “Parallel and Distributed Computation” and “Neuro-Dynamic Programming”. In 2001, he was elected to the United States National Academy of Engineering for “pioneering contributions to fundamental research, practice and education of optimization/control theory, and especially its application to data communication networks.”
Dr. Bertsekas’ recent books are “Introduction to Probability: 2nd Edition” (2008), “Convex Optimization Theory” (2009), “Dynamic Programming and Optimal Control,” Vol. I, (2017), and Vol. II: (2012), “Convex Optimization Algorithms” (2015), “Abstract Dynamic Programming” (2018), “Reinforcement Learning and Optimal Control” (2019), and “Rollout, Policy Iteration, and Distributed Reinforcement Learning” (2020), all published by Athena Scientific.