Abstract
Greedy-GQ is an off-policy two timescale algorithm for optimal control in reinforcement learning [18]. This paper develops the first finite-sample analysis for the Greedy-GQ algorithm with linear function approximation under Markovian noise. Our finite-sample analysis provides theoretical justification for choosing step-sizes for this two timescale algorithm for faster convergence in practice, and suggests a trade-off between the convergence rate and the quality of the obtained policy. Our paper extends the finite-sample analyses of two timescale reinforcement learning algorithms from policy evaluation to optimal control, which is of more practical interest. Specifically, in contrast to existing finite-sample analyses for two timescale methods, e.g., GTD, GTD2 and TDC, where their objective functions are convex, the objective function of the Greedy-GQ algorithm is non-convex. Moreover, the Greedy-GQ algorithm is also not a linear two-timescale stochastic approximation algorithm. Our techniques in this paper provide a general framework for finite-sample analysis of non-convex value-based reinforcement learning algorithms for optimal control.
| Original language | English |
|---|---|
| Pages (from-to) | 11-20 |
| Number of pages | 10 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 124 |
| State | Published - 2020 |
| Event | 36th Conference on Uncertainty in Artificial Intelligence, UAI 2020 - Virtual, Online Duration: Aug 3 2020 → Aug 6 2020 |
Fingerprint
Dive into the research topics of 'Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver