Skip to main navigation Skip to search Skip to main content

Reinforcement Learning for Cost-Aware Markov Decision Processes

  • Wesley A. Suttle
  • , Kaiqing Zhang
  • , Zhuoran Yang
  • , David N. Kraemer
  • , Ji Liu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Ratio maximization has applications in areas as diverse as finance, reward shaping for reinforcement learning (RL), and the development of safe artificial intelligence, yet there has been very little exploration of RL algorithms for ratio maximization. This paper addresses this deficiency by introducing two new, model-free RL algorithms for solving cost-aware Markov decision processes, where the goal is to maximize the ratio of long-run average reward to long-run average cost. The first algorithm is a two-timescale scheme based on relative value iteration (RVI) Q-learning and the second is an actor-critic scheme. The paper proves almost sure convergence of the former to the globally optimal solution in the tabular case and almost sure convergence of the latter under linear function approximation for the critic. Unlike previous methods, the two algorithms provably converge for general reward and cost functions under suitable conditions. The paper also provides empirical results demonstrating promising performance and lending strong support to the theoretical results.

Original languageEnglish
Title of host publicationProceedings of the 38th International Conference on Machine Learning, ICML 2021
PublisherML Research Press
Pages9989-9999
Number of pages11
ISBN (Electronic)9781713845065
StatePublished - 2021
Event38th International Conference on Machine Learning, ICML 2021 - Virtual, Online
Duration: Jul 18 2021Jul 24 2021

Publication series

NameProceedings of Machine Learning Research
Volume139

Conference

Conference38th International Conference on Machine Learning, ICML 2021
CityVirtual, Online
Period07/18/2107/24/21

Fingerprint

Dive into the research topics of 'Reinforcement Learning for Cost-Aware Markov Decision Processes'. Together they form a unique fingerprint.

Cite this