#### Domain-Dependent Policy Learning using Neural Networks for Classical Planning (1/4)

I have finished my undergraduate Bachelor studies last summer and as a start to this blog I will outline the work I did for my dissertation titled “Domain-Dependent Policy Learning using Neural Networks in Classical Planning”. I will split this summary over four posts which will mostly be constructed of paragraphs of my thesis, summaries of such or parts of the kolloquium presentation I held at the group seminar of the Foundations of Artificial Intelligence (FAI) group at Saarland University.

TL;DR: I transferred and applied a neural network architecture called Action Schema Networks, designed for policy learning for probabilistic planning, to deterministic, classical planning and evaluated its performance.

# Introduction

Machine learning (ML) is a subfield of artificial intelligence (AI) which received tremendous media and research attention over the last years. While these terms are frequently used as if they were synonyms, such usage is misleading. ML specifically covers branches of AI involving some form of learning, mostly from large sets of data, while AI is more broad and general. It therefore aims to solve one of the main remaining challenges of computers, their limitation to obtain and rationally apply knowledge. This is arguably the main reason why humans are still superior to computers in many tasks despite their gradually increasing computational power.

The success of ML can already be seen in applications just as Alpha Go (Silver et al., 2016) and Alpha Go Zero (Silver et al., 2017), Google Deepmind’s agents, capable of beating human professionals at the Chinese board game Go. This achievement was assumed to still be decades from reality due to the game’s computational complexity constructed by its $\approx 10^{170}$ different board states (Müller, 2002; Tromp & Farnebäck, 2006). For comparisons, it is assumed that the observable universe has about $10^{80}$ atoms. In the core of these programs were neural networks, which are often referred to under the title of deep learning, combined with algorithms from or at least related to automated planning, another field of AI with the big goal of creating an intelligent agent capable of eﬃciently solving (almost) arbitrary problems. While this sounds like a vision far in the future, modern planning systems are already able to solve a wide variety of tasks, e.g. complex scheduling tasks.

However, it might be surprising that planning has seen little interaction with the ﬁeld of machine learning despite its rise in popularity. Only in recent past, these two ﬁelds were combined with mixed success. One of these combinational approaches was the work of Toyer et al. from the Australian National University who recently proposed a new neural network structure designed for application in probabilistic and classical automated planning, called Action Schema Networks (ASNets) (Toyer, 2017; Toyer, Trevizan, Thiébaux, & Xie, 2017). These are able to learn domain-speciﬁc knowledge in planning and apply it to unseen problems of the same domain. The promising structure was primarily introduced and evaluated with respect to probabilistic planning.

Therefore, the goal of my dissertation was to evaluate the possible performance of Action Schema Networks in classical planning. The main contribution of the thesis was the implementation of this novel neural network structure in the Fast-Downward planning system (Helmert, 2006) for application in deterministic, classical planning with necessary extensions to the framework and an extensive empirical evaluation was conducted to assess ASNets on multiple tasks of varying complexity. This evaluation considered diﬀerent conﬁgurations to in the end state whether Action Schema Networks are a suitable method for classical planning and if so under which conditions aiming towards the goal of learning complex relations occurring in planning tasks.

# Background

## Classical Automated Planning

Classical automated planning focuses on ﬁnite, deterministic, fully-observable problems solved by a single agent. The predominant formalisation for planning tasks is STRIPS (Fikes & Nilsson, 1971) representing such a task as $\Pi = (\mathcal{P}, \mathcal{A}, c, I, G)$:

• $\mathcal{P}$ is a set of propositions (or facts)

• $\mathcal{A}$ is a set of actions where each action $a \in \mathcal{A}$ is a triple $(pre_a , add_a , del_a)$ with $pre_a, add_a , del_a \subseteq \mathcal{P}$ including a’s preconditions, add list and delete list with $add_a \cap del_a = \emptyset$

• preconditions are facts, which have to be true for $a$ to be applicable

• add list contains all propositions becoming true after applying $a$

• delete list contains all propositions becoming false after applying $a$

• $c : A \rightarrow \mathbb{R}^+_0$ is the cost function assigning all actions to their cost

• $I \subseteq \mathcal{P}$ is the initial state containing all propositions, which are true at the start of the task

• $G \subseteq \mathcal{P}$ is the goal with all facts which have to become true to solve the task

### Brief Example

For example, one could describe a transportation task, as illustrated above, in which a truck has to deliver a package from some location to its destination by driving along streets and load or unload packages. In this concrete example, the truck has to drive from London (L) to Manchester (M), pick up the package, drive to Edinburgh (E) and unload the package there. The task would include the propositions $\mathcal{P} = \{at(o, x) \mid o \in \{t, p\},$ $x \in \{L, M, G, E\}\}$ and actions $\mathcal{A} = \{drive(x, y, z) \mid x \in \{truck\},$ $y, z \in \{L, M, G, E\},$ $y \text{ and } z \text{ are connected}\}$ $\cup$ $\{load(x, y, z), unload(x, y, z) \mid x \in \{truck\},$ $y \in \{package\}, z \in \{L, M, G, E\}\}$. The goal could be formalised as $\{at(package, E)\}$ and the initial state describes the starting position of the truck and package as $\{at(truck, L), at(package, M)\}$.

To solve any task $\Pi$ in automated planning, the planner has to observe the current state and choose actions, one at a time, in order to reach a goal state $s^*$ with $G \subseteq s^∗$. The sequence of actions, applied to get to such a state, is called plan for $\Pi$. A plan is considered optimal if it has the least cost out of all plans reaching a goal. E.g., the optimal plan for our transport task would be $\langle drive(truck, L, M), load(truck, package, M), drive(truck, M, E), unload(truck, package, E) \rangle$.

### Modelling - PDDL

One essential component of planning is to model the task at hand. This process is usually split into two components: the domain and problem. This separation has its origin in the main modelling language for planning PDDL (Planning Domain Deﬁnition Language) introduced by McDermott et al. (McDermott & others, 1998). A domain describes a whole family of various problems sharing the core idea. It contains predicates deﬁned on abstract objects as well as action schemas. Problem instances are always assigned a domain which predeﬁnes mentioned elements. In the problem ﬁle concrete objects are deﬁned instantiating the predicates and action schemas of the domain to propositions and actions respectively. Furthermore the initial and goal states are speciﬁed.

Most of these planning problems seem conceptually easy for rational-thinking humans, but this impression can be misleading. In fact, planning is computationally extremely difficult. Merely deciding whether a task is solvable is already PSPACE-complete (Bylander, 1994).

## Deep Learning

### Foundation

The idea of neural networks (NNs) has a long history reaching back to the 1940s (McCulloch & Pitts, 1986) inspired by the human brain whose immensely impressive capabilities are partly due to the dense connectivity of neurons. With the introduction of the perceptron, which was capable of learning, by F. Rosenblatt in 1958 (Rosenblatt, 1958) and backpropagation by Rumelhart et al. in 1986 (Rumelhart, Hinton, & Williams, 1988) the foundation for modern NNs was built.

The simplest, modern NN architecture is the fully-connected feedforward network or multi-layer perceptron (MLP) as illustrated below. As (almost) all neural network architectures, it consists of various layers containing nodes or units which are successively connected with each other. One usually speaks of input layer (yellow), hidden layers (blue) and an output layer (green). Each connection of two nodes is weighted with a parameter.

A node is usually computing a fairly simple mathematical operation in which it applies the weights to each input of the previous layer, adds a bias vector and finally applies a nonlinear function $f$, often called activation function. The output of the $l$-th layer $h^l$ can therefore be computed as follows:

The weights and bias vectors are collectively stored and form the “learned intelligence” of these networks gradually optimised with respect to some goal. The objective is usually represented by a loss function $L(\hat{y}, y)$ depending on the true output $y$ and the prediction $\hat{y}$ computed by the network for a given input $x$. This form of training is called supervised learning and depends on labelled training data including inputs as well as their expected output. The optimisation is usually achieved by gradient-descent in which all parameters, annotated as $\theta$, are updated in the direction of the steepest descent of the loss function $L$ for some step-size $\alpha$, called learning rate:

### Convolutional Neural Networks

Over the last few years, many different architectures of such networks evolved with specific applications. Recurrent neural networks (RNNs) are capable of considering past knowledge and decisions and therefore incorporate something similar to a memory. This property turned out especially valuable whenever processing language reaching state-of-the-art in tasks as machine translation (Bahdanau, Cho, & Bengio, 2014; Cho et al., 2014) and speech recognition (Graves, Mohamed, & Hinton, 2013).

Similarly, convolutional neural networks (CNNs) became the standard for many multi-dimensional, mostly visual, input tasks reaching unseen accuracy in e.g. image classification (Krizhevsky, Sutskever, & Hinton, 2012). The main characteristic of CNNs is the application of the mathematical convolution operation. This linear operation replaces the typical matrix multiplication known from MLPs where every unit in each layer has a weighted connection to every node in the successive layer. In convolution, smaller weight matrices, called filters or kernels, are applied by “sliding” the ﬁlters over the units in one layer with each applying its operation to a set of neighboured inputs. This form of processing brings multiple advantages.

Due to the usually smaller size of filters, CNNs have sparse connectivity, only combining neighboured units in one operation. This makes use of local properties to extract input features like edges in visual domains, which is especially meaningful in deep CNNs. While shallow layers detect e.g. edges or shapes of an input image, filters in deeper layers could work upon these features and detect increasingly abstract objects like cars and humans.

Source: https://www.mathworks.com/discovery/convolutional-neural-network.html (03.12.2018)

Additionally, filters are reused repeatedly during the “sliding”, applying their operation to (partially) different input units. This form of weight sharing allows to significantly reduce the amount of parameters needed. Hence, the memory requirements of the networks are lowered, making them more efficient than fully-connected NNs because less parameters have to be learned, so needed training time and data can be reduced by the approach.

In the next post, I will outline ASNets as a network structure before explaining my adjustments for application in classical planning as well as the results of the evaluation in the third and forth part.

### References

1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
2. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … Hassabis, D. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. ArXiv e-Prints.
3. Müller, M. (2002). Computer go. Artificial Intelligence, 134(1-2), 145–179.
4. Tromp, J., & Farnebäck, G. (2006). Combinatorics of go. In International Conference on Computers and Games (pp. 84–99). Springer.
5. Toyer, S. (2017). Generalised Policies for Probabilistic Planning with Deep Learning (Research and Development, Honours thesis). Research School of Computer Science, Australian National University.
6. Toyer, S., Trevizan, F. W., Thiébaux, S., & Xie, L. (2017). Action Schema Networks: Generalised Policies with Deep Learning. CoRR, abs/1709.04271. Retrieved from http://arxiv.org/abs/1709.04271
7. Helmert, M. (2006). The Fast Downward Planning System. J. Artif. Int. Res., 26(1), 191–246. Retrieved from http://dl.acm.org/citation.cfm?id=1622559.1622565
8. Fikes, R. E., & Nilsson, N. (1971). STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Ai, 2, 189–208.
9. McDermott, D., & others. (1998). The PDDL Planning Domain Definition Language. The AIPS-98 Planning Competition Comitee.
10. Bylander, T. (1994). The Computational Complexity of Propositional STRIPS Planning. Ai, 69(1–2), 165–204.
11. McCulloch, W., & Pitts, W. (1986). A Logical Calculus of the Ideas Immanent in Nervous Activity. Brain Theory, 229–230.
12. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 64(6), 386–408.
13. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of Research. In J. A. Anderson & E. Rosenfeld (Eds.) (pp. 696–699). Cambridge, MA, USA: MIT Press. Retrieved from http://dl.acm.org/citation.cfm?id=65669.104451
14. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv Preprint ArXiv:1409.0473.
15. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. ArXiv Preprint ArXiv:1406.1078.
16. Graves, A., Mohamed, A.-rahman, & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (pp. 6645–6649). IEEE.
17. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf