Title: Robust batch policy learning for indefinite-horizon Markov decision processes
Authors: Zhengling Qi - The George Washington University (United States) [presenting]
Abstract: The indefinite-horizon Markov decision process (MDP) is considered where each policy is evaluated as a set of average rewards over different horizon lengths with different reference distributions. Given pre-collected data generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can approximately maximize the smallest value of the set. Leveraging semi-parametric statistics, we develop an efficient policy learning method for estimating the defined robust optimal policy. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points. Our regret guarantee subsumes the long-term average reward MDP setting as a special case and can be extended to the discounted infinite-horizon setting.