Policy Search Based Relational Reinforcement Learning using the Cross-Entropy Method
Main text, 2.482Mb
Supplementary material, 60.51Mb
Supplementary material, 38.10Mb
Supplementary material, 129.4Mb
Supplementary material, 318.4Kb
Sarjant, S. (2013). Policy Search Based Relational Reinforcement Learning using the Cross-Entropy Method (Thesis, Doctor of Philosophy (PhD)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/7671
Permanent Research Commons link: https://hdl.handle.net/10289/7671
Relational Reinforcement Learning (RRL) is a subfield of machine learning in which a learning agent seeks to maximise a numerical reward within an environment, represented as collections of objects and relations, by performing actions that interact with the environment. The relational representation allows more dynamic environment states than an attribute-based representation of reinforcement learning, but this flexibility also creates new problems such as a potentially infinite number of states. This thesis describes an RRL algorithm named Cerrla that creates policies directly from a set of learned relational “condition-action” rules using the Cross-Entropy Method (CEM) to control policy creation. The CEM assigns each rule a sampling probability and gradually modifies these probabilities such that the randomly sampled policies consist of ‘better’ rules, resulting in larger rewards received. Rule creation is guided by an inferred partial model of the environment that defines: the minimal conditions needed to take an action, the possible specialisation conditions per rule, and a set of simplification rules to remove redundant and illegal rule conditions, resulting in compact, efficient, and comprehensible policies. Cerrla is evaluated on four separate environments, where each environment has several different goals. Results show that compared to existing RRL algorithms, Cerrla is able to learn equal or better behaviour in less time on the standard RRL environment. On other larger, more complex environments, it can learn behaviour that is competitive to specialised approaches. The simplified rules and CEM’s bias towards compact policies result in comprehensive and effective relational policies created in a relatively short amount of time.
University of Waikato
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
- Higher Degree Theses