all posts

Deep Reinforcement Learning for Continuous Control

Simon Ramstedt, 2016-04-30 · Bachelor thesis, TU Darmstadt, Department of Computer Science, Intelligent Autonomous Systems group. Referees: Prof. Dr. Gerhard Neumann, Prof. Dr. Jan Peters, MSc. Simone Parisi.

Abstract

Reinforcement learning is a mathematical framework for agents to interact intelligently with their environment. In this field, real-world control problems are particularly challenging because of the noise and the high-dimensionality of input data (e.g., visual input). In the last few years, deep neural networks have been successfully used to extract meaning from such data. Building on these advances, deep reinforcement learning achieved stunning results in the field of artificial intelligence, being able to solve complex problems like Atari games [1] and Go [2]. However, in order to apply the same methods to real-world control problems, deep reinforcement learning has to be able to deal with continuous action spaces. In this thesis, Deep Deterministic Policy Gradients, a deep reinforcement learning method for continuous control, has been implemented, evaluated and put into context to serve as a basis for further research in the field.

German abstract

Reinforcement-Learning ist ein mathematischer Rahmen, um intelligent mit ihrer Umgebung interagierende Agenten zu erzeugen. Regelungsprobleme in realen Umgebungen sind dabei wegen stark verrauschten, hochdimensionalen Eingabedaten (z.B. Video) besonders anspruchsvoll. In den letzten Jahren wurden dafür jedoch erfolgreich neuronale Netze benutzt. Deep-Reinforcement-Learning (Reinforcement-Learning mit neuronalen Netzen) hatte bereits große Erfolge in der künstlichen Intelligenz und war in der Lage komplexe Probleme wie Go [2] oder Atari-Spiele [1] zu lösen. Um diese Methoden aber in der echten Welt anwenden zu können, muss Deep-Reinforcement-Learning mit kontinuierlichen Handlungsräumen umgehen können. Deshalb wurde in dieser Thesis Deep Deterministic Policy Gradients, eine Deep-Reinforcement-Learning-Methode für kontinuierliche Regelungen implementiert, evaluiert und in Bezug zu anderen Methoden gesetzt.

Acknowledgments

I thank my supervisors Simone Parisi and Gerhard Neumann for their patience and helpful discussions, with special thanks to Simone for his support during time-critical periods.

1 Introduction

Reinforcement learning is a mathematical framework for agents to interact intelligently with their environment. Unlike supervised learning, where a system learns with the help of labeled data, reinforcement learning agents learn how to act by trial and error only receiving a reward signal from their environments. A field where reinforcement learning has been prominently successful is robotics [3]. However, real-world control problems are also particularly challenging because of the noise and high-dimensionality of input data (e.g., visual input). In recent years, in the field of supervised learning, deep neural networks have been successfully used to extract meaning from this kind of data. Building on these advances, deep reinforcement learning was used to solve complex problems like Atari games and Go. Mnih et al. [1] built a system with fixed hyperparameters able to learn to play 49 different Atari games only from raw pixel inputs. However, in order to apply the same methods to real-world control problems, deep reinforcement learning has to be able to deal with continuous action spaces. Discretizing continuous action spaces would scale poorly, since the number of discrete actions grows exponentially with the dimensionality of the action. Furthermore, having a parametrized policy can be advantageous because it can generalize in the action space. Therefore with this thesis we study a state-of-the-art deep reinforcement learning algorithm, Deep Deterministic Policy Gradients. We provide a theoretical comparison to other popular methods, an evaluation of its performance, identify its limitations and investigate future directions of research.

The remainder of the thesis is organized as follows. We start by introducing the field of interest, machine learning, focusing our attention on deep learning and reinforcement learning. We continue by describing in detail the two main algorithms, core of this study, namely Deep Q-Network (DQN) and Deep Deterministic Policy Gradients (DDPG). We then provide implementatory details of DDPG and our test environment, followed by a description of benchmark test cases. Finally, we discuss the results of our evaluation, identifying limitations of the current approach and proposing future avenues of research.

2 Foundations

Machine learning is an approach to design and optimize information processing systems by directly using data. As typically in real-world problems the training data is limited and does not cover all possible scenarios, the system has to learn how to behave also in the presence of data which it has not been trained on. In this regard, overfitting is one of the biggest challenges in machine learning and consists in having a system that strictly adapts its behavior to the training data, without being able to generalize to different unseen input data.

Machine learning is usually divided into supervised learning, reinforcement learning and unsupervised learning. Supervised learning systems learn input-output mappings from a dataset of desired input-output pairs, i.e. they are explicitly told how to behave. As we will see in the next section, most of deep learning and neural network research so far has been done in the supervised setting. Reinforcement learning systems, on the contrary, learn how to behave by receiving feedback from the environment encoding a specific goal. For example, a trash-collector robot would receive a reward for collecting trash or would be punished for hitting a wall. It is common for reinforcement learning to exploit supervised learning techniques. Unsupervised learning, on the other hand, is about finding patterns in data and will not be discussed further in this thesis. In the next sections we will describe in more detail one of the most prominent supervised learning techniques, namely deep learning and neural networks.

2.1 Deep Learning and Neural Networks

Deep learning is an area of machine learning concerned with deep neural networks. Neural networks are non-linear parametric functions loosely inspired by the human brain. They are composed of layers of units called neurons, as shown in Figure 2.1.

Neuron
Figure 2.1: A canonical artificial neuron. The sum of the inputs xix_i weighted by wiw_i and the bias bb is fed into the activation function ff, producing the neuron output y=f(iwixi+b)y = f(\sum_i{w_{i} x_i + b}). The weights wiw_i represent a pattern in the neuron's input space. The closer the input is to that pattern, the higher the output of the neuron will be.
ReLU
(a) Rectified Linear Unit (ReLU): f(x)=max(0,x)f(x) = \max(0,x).
Tanh
(b) Hyperbolic Tangent (tanh): f(x)=tanh(x)f(x) = \tanh(x).
Figure 2.2: Examples of activation functions ff. Usually, the function is monotonically increasing and saturates for low / negative inputs.

Multiple parallel neurons form a layer, each characterized by outputs called activations yj=f(iwi,jxi)y_j = f(\sum_i{w_{i,j} x_i}). Single layer (shallow) neural networks called perceptron have been first described by Rosenblatt in 1958 [4]. However in deep neural networks multiple layers are stacked on top of each other. Krizhevsky et al. [5] achieved state-of-the-art computer vision results with an 8-layer, 500,000 neurons and 60-million parameter neural network. Since the outputs of a layer are non-linear features of the input, the more layers are in the network, the more non-linear and abstract those features are compared to the original input. This hierarchy of features enables neural networks to approximate complex functions.

Neural networks are usually initialized randomly and then trained on a dataset with regard to a cost function. Typically, cost functions are distance measures between the desired outputs (targets tit_i) and the actual network outputs (yiy_i). A common and simple cost function is the mean squared error 1ni=1n(yiti)2\frac{1}{n} \sum_{i=1}^n (y_i - t_i)^2. Training a neural network consists in optimizing the network parameters θ=(w1,b1,,wn,bn)\theta = (w_1, b_1, \cdots, w_n, b_n) to minimize the cost on the training dataset. However, training a neural network can be challenging. First, as the parameter space can be non-convex, neural networks are usually optimized with techniques guaranteeing convergence to a local optimum (e.g., gradient descent). Second, the number of parameters grows with the complexity of the network. As deep, complex networks are typically required to learn difficult functions, the training can be highly demanding, both in terms of computational time and data. Nevertheless, in practice neural networks can be optimized quite reliably by gradient descent, as we will see in the next section.

Deep Neural Network
Figure 2.3: Multi-layer (deep) neural network. Neurons in higher (leftmost) layers represent more abstract features (top) of the network input.

2.1.1 Stochastic Gradient Descent

Using the gradient for optimization (i.e., first order optimization) is common to many machine learning algorithms with high number of parameters, since zeroth order methods (e.g., genetic algorithms) need several function evaluations and higher nn-order methods are too expensive per evaluation. Therefore, gradient descent is a key element of deep learning. It consists in following the direction pointed by the gradient of the cost function with respect to the parameters of the system. In the case of neural networks, the gradient of the cost function with respect to the network parameters θ\theta tells us how to change the parameters in order to reduce the cost. More specifically, given the gradient of the cost with respect to the parameters dCdθ\frac{dC}{d\theta}, we can update the parameters θnewθoldα dCdθ\theta_\text{new} \leftarrow \theta_\text{old} - \alpha \ \frac{dC}{d\theta}, where α\alpha is a learning rate. However, computing the cost and the gradient on large datasets is expensive. Stochastic gradient descent alleviates this issue by only computing the cost and the gradient for a small subset (minibatch) of the dataset. Like gradient descent, stochastic gradient descent is guaranteed to converge to a local minimum.

In high-dimensional optimization spaces like those of neural networks, optimization often gets stuck near saddle points or valleys where gradients are only high in directions orthogonal to the valley in which no progress can be made. This issue can be alleviated by having adaptive learning rates for each parameter. A recent version of stochastic gradient descent using these techniques and used in this thesis is ADAM [6]. Another interesting property of ADAM is that its stepsize is independent of the scale of the gradients.

To compute the gradient in a neural network, we can use automatic differentiation, a technique to compute derivatives in computational graphs in a modular way. It is based on the chain rule, which says that the derivative of a composition y=f(g(x))y = f(g(x)) of two functions g:xwg: x \to w and f:wyf: w \to y can be decomposed into dydx=dydwdwdx=g(w)f(x)\frac{dy}{dx} = \frac{dy}{dw} \frac{dw}{dx} = g'(w) \cdot f'(x). In a computational graph, we can compute the derivative of any edge yy with respect to any other edge xx by first computing the derivative of the last node g(w)g'(w) and then compute the derivative of all the previous nodes f(x)f'(x) by decomposing them in the same way. Derivatives are then passed backwards through the computational graph. An example is depicted in Figure 2.4.

Automatic Differentiation
Figure 2.4: Learning iteration in a computational graph. First, the network output and cost CC are computed (left). Subsequently, a backward pass is executed to compute derivatives (right).

2.1.2 Normalization

Neural network training is highly sensitive to the mean and variance of the data. They affect the cost, gradients, activations and the operation region of the activation functions. This issue makes it hard to select good learning rates, parameter initializations and to set other hyperparameters. Also, scale differences between dimensions within a dataset result in skewing the cost surfaces (Figure 2.5a), thus resulting in a wrong direction of the gradients. Normalizing the data to zero mean and unit variance alleviates these issues (Figure 2.5b). Furthermore, normalization is not only important for the network inputs and targets but also for the activations inside the network. During training, each layer has to constantly adapt to its changing input distribution caused by the optimization of the previous layers. For alleviating this issue, Ioffe and Szegedy [7] proposed batch normalization, a technique consisting in ensuring zero mean and unit variance for activations for each minibatch resulting in training time reduction by an order of magnitude.

Skewed cost surface
(a) Skewed cost surface due to unnormalized data. The gradient (red arrow) points in a direction almost orthogonal to the optimum.
Normalized cost surface
(b) Normalized cost surface. The red arrows (gradients at each timesteps) correctly point towards the optimum.
Figure 2.5: Examples of contour plots of cost surfaces of one neuron with two weights.

In the next section we extend the discussion of deep learning techniques to reinforcement learning.

2.2 Reinforcement Learning

Reinforcement learning is concerned with learning through interaction with an environment. At every timestep, a learner or decision-maker called agent executes an action and the environment in turn yields a new observation and a reward, as shown in Figure 2.6. The task of the agent is to maximize the sum of rewards received during the interaction with the environment.

Agent Environment action reward state
Figure 2.6: Agent-environment interaction in reinforcement learning. The agent can be anything interacting with an environment and able to improve its behavior (a human, an animal, a control system). Everything not learned is part of the environment (sensors, motors, rewards).

More formally, we can define a state sSRdss \in S \subset \mathbb{R}^{d_s} as all the information the agent has about the environment at a given timestep. Generally, this information might not include the full state of the environment (e.g., in a card game the state would include the agent's hand but not all opponents' hands even though they are part of the full state of the game).

An action aARdaa \in A \subset \mathbb{R}^{d_a} encodes how the agent can interact with the environment. The mapping from states to actions is called policy π:SA\pi: S \to A. A policy can either be stochastic (e.g., a probability distribution of actions over states) or deterministic.

The reward rRr \in \mathbb{R} is a feedback informing the agent about the immediate quality of its actions. Typically, the function generating the rewards is defined by an expert and can depend on the last state and action r:S×ARr: S \times A \to \mathbb{R}. The reward signal can encode the goal of the agent at different levels. For instance, in chess the agent can be rewarded at each capture or only at the end of the match.

The goal of the agent is to maximize the sum of the rewards received by the environment, namely the return R=tγtrtR = \sum_t{\gamma^t r_t}. The discount factor γ(0,1]\gamma \in (0,1] is to guarantee the convergence of the sum if the time horizon is not finite (infinite horizon). In the next section, we present a brief overview of classical approaches to solve reinforcement learning problems.

2.2.1 Learning Approaches

Reinforcement learning algorithms can be categorized as either value-based, policy-based or a combination of the two. Value-based methods consist in explicitly learning the value of all states and using it to select the action that leads to the highest-valued state. In this setting, the value function VV is defined as the expected return of the agent being in state sts_t and then following the policy π\pi, while the action-value function QQ is defined as the expected return of the agent being in state sts_t, executing action ata_t and then following the policy π\pi. The advantage function AA connects both.

Vπ(st)=Erit,si>tE; aitπ [i=tTγ(it)r(si,ai)],Qπ(st,at)=Erit,si>tE; ai>tπ [i=tTγ(it)r(si,ai)],Aπ(st,at)=Qπ(st,at)Vπ(st).\begin{aligned} V^{\pi}(s_t) &= \mathbb{E}_{r_{i \geq t}, s_{i>t} \sim E ; \ a_{i \geq t} \sim \pi} \ \Big[\sum_{i=t}^T \gamma^{(i-t)} r(s_i,a_i) \Big], \\ Q^{\pi}(s_t,a_t) &= \mathbb{E}_{r_{i \geq t}, s_{i>t} \sim E ; \ a_{i > t} \sim \pi} \ \Big[\sum_{i=t}^T \gamma^{(i-t)} r(s_i,a_i) \Big], \\ A^{\pi}(s_t,a_t) &= Q^{\pi}(s_t,a_t) - V^{\pi}(s_t). \end{aligned}

For deterministic policies, QQ can be formulated recursively by the so-called Bellman equation

Qπ(st,at)=Ert,st+1E [rt+γQπ(st+1,π(st+1))].Q^{\pi}(s_t,a_t) = \mathbb{E}_{r_t,s_{t+1} \sim E} \ [r_t + \gamma Q^{\pi}(s_{t+1},\pi(s_{t+1}))].

Knowing Qπ(st,at)Q^{\pi^*}(s_t,a_t) for each state-action pair, the best policy π\pi^* is to select the action with the highest action value at every timestep, i.e., π(st)=argmaxa Qπ(st,a)\pi^*(s_t) = \text{argmax}_a \ Q^{\pi^*}(s_t,a). In practice we neither know QπQ^{\pi^*} nor π\pi^* in the first place. However, even when starting from a random QQ, it is possible to iteratively update QQ by exploiting the Bellman equation and to converge to QπQ^{\pi^*} (and therefore to π\pi^*). This powerful approach is a key element in recent Deep-Q-Networks (DQN), which will be discussed further in Section 3.1.

In continuous action spaces, however, maximizing over QQ is infeasible as there are infinite actions to consider. Even discretizing the action space becomes intractable for relatively low dimensional action spaces (curse of dimensionality). To overcome this issue, it might be better to directly learn a (parametrized) policy instead of a value function. Algorithms following this approach are called policy-based. One of the most successful class of policy-based algorithms is policy gradient [8, 9, 10]. Policy gradient approaches directly optimize the policy parameters ξ\xi by following the direction of the gradient of the expected return with respect to the policy parameters (ξ R\nabla_\xi \ R), which can be directly estimated from samples.

A mixture of the value-based and policy-based approaches are actor-critic algorithms. These approaches make use of both a parametric policy (actor) and a value function estimator (critic), used to improve the policy. A very recent actor-critic approach for learning deterministic policies is Deterministic Policy Gradients (DPG) [11]. DPG uses a differentiable action-value function approximator to obtain the policy gradient by taking the derivative of its output with respect to the action input dQ/dadQ/da. The policy gradient is then computed as ξQ=aQξπ\nabla_\xi Q = \nabla_a Q \cdot \nabla_\xi \pi. Deep Deterministic Policy Gradient (DDPG), a DPG with neural network function approximators, will be discussed further in Section 3.2.

2.2.2 Exploration and Exploitation

One of the biggest issues in reinforcement learning is the tradeoff between exploration and exploitation. Acting greedily (exploitation) with respect to an approximated function (e.g., Q-function) and choosing the current best action might prevent the agent from discovering new better states and therefore prevent improvement of the policy. On the contrary, excessive exploration might slow down the learning or even results in harmful policies. A tradeoff is therefore necessary. Usually, noise is added to the actions during training. In the case of discrete actions, the ϵ\epsilon-greedy policy is a common solution: the agent acts randomly with probability ϵ\epsilon and greedily with probability 1ϵ1-\epsilon. In the case of continuous actions, Gaussian noise could instead be added. In this thesis, we will use these simple strategies. However, the exploration-exploitation tradeoff is still an open problem in reinforcement learning. Alternative exploration strategies might consist in artificially rewarding exploration or even leaving the decision about exploration completely to the agent. We will come back to this topic in Section 6.2.

3 Deep Reinforcement Learning

As discussed in the previous section, reinforcement learning heavily relies on either approximating the Q-function (in the case of value-based algorithms) or on the policy parameterization (in the case of policy-based algorithms). In both cases, the use of rich function approximators or policies allows reinforcement learning to scale to complex problems. Neural networks have been successfully used as function approximators in supervised learning and are differentiable, a useful quality for many reinforcement learning algorithms. In this section, we focus on two recent reinforcement learning algorithms. The first one, Deep Q-Network (DQN) [1, 13], is a value-based algorithm successfully able to play 49 Atari games from pixels better than human experts. The second one, Deep Deterministic Policy Gradients (DDPG) [12], is an actor-critic algorithm, an extension to DQN for continuous actions.

3.1 Deep Q-Network (DQN)

DQN is a value-based algorithm using a neural network Q(s,aθ)Q(s,a | \theta) to approximate the optimal action-value QQ^* of each action in a given state. The training targets for QQ are computed via the Bellman equation yj=rj+γQ(sj,π(sjθ) θ)y_j = r_j + \gamma Q(s_j, \pi(s_j | \theta') \ | \theta'). The network is trained to minimize the mean squared error with respect to the Q-function, i.e.,

C(θ θ)=1mj(rj+γ Q(sj+1,π(sj+1θ) θ)yj  Q(sj,π(sjθ) θ))2=1mj(rj+γ maxa Q(sj+1,a θ)yj  maxa Q(sj,a θ) )2.\begin{aligned} C(\theta \ | \theta') &= \frac{1}{m} \sum_j \bigg( \underbrace{r_j + \gamma \ Q\big(s_{j+1},\pi(s_{j+1} | \theta') \ \big| \theta'\big)}_{y_j} \ - \ Q\big(s_{j},\pi(s_{j} | \theta) \ \big| \theta \big) \bigg) ^ 2 \\ &= \frac{1}{m} \sum_j \bigg( \underbrace{r_j + \gamma \ \max_a \ Q\big(s_{j+1},a \ \big| \theta'\big)}_{y_j} \ - \ \max_a \ Q\big(s_{j},a \ \big| \theta'\big) \ \bigg) ^ 2. \end{aligned}

However, the dependence of the Q targets on QQ itself can lead to instabilities or even divergence during learning. Having a second set of network parameters θ=LP(θ)\theta' = \text{LP}(\theta), where LP is a low-pass filter (e.g., exponential moving average), stabilizes the learning.

Additional instabilities can arise from training directly on the incoming states and rewards, since, unlike supervised learning, the input data (state-action pairs) is highly correlated as they are part of a trajectory. Furthermore, the policy and therefore data distribution might change quickly as QQ evolves. DQN solves both issues by storing all (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1}) transitions in a replay memory dataset DD and then learning on minibatches consisting of random transitions from DD. This trick breaks the correlation of the input data and smoothes the changes in the input distribution. It also increases data efficiency by allowing the agent to perform multiple gradient descend steps on the same transition.

π s Q q(a₁) q(aₙ) argmax a
Figure 3.1: Architecture of DQN. The Q-Network outputs an estimate of the action-value function for each action. Subsequently, the policy π\pi chooses the action with the highest value.

Finally, to ensure sufficient exploration, a simple ϵ\epsilon-greedy policy, i.e., π(sθ)=argmaxaQ(s,aθ)\pi(s | \theta) = \arg\max_a Q(s,a | \theta). This off-policy approach is possible because the algorithm does not learn on full trajectories but only on isolated transitions. The complete algorithm is shown in Algorithm 1.

Algorithm 1: DQNInitialize replay memory DD Initialize QQ with random weights θ\theta Initialize target weights θθ\theta' \leftarrow \theta for t=1t = 1 to TT do Select action at=ϵ-greedy [argmaxa Q(st,aθ)]a_t = \epsilon\text{-greedy}~[\text{argmax}_a \ Q(s_t,a | \theta)] Execute ata_t and observe reward rtr_t and next state st+1s_{t+1} Store transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1}) in DD Sample random minibatch of mm transitions from DD Set targets yj=rj+γmaxaQ(a,sj+1θ)y_j = r_j + \gamma \max_a Q(a,s_{j+1} | \theta') Perform gradient descent on cost C=1mj(yjQ(sj,ajθ))2C = \frac{1}{m}\sum_j(y_j - Q(s_j,a_j | \theta))^2 with respect to θ\theta Update θLP(θ)\theta' \leftarrow \text{LP}(\theta) end for

3.2 Deep Deterministic Policy Gradient (DDPG)

As discussed in the introduction, a parametrized policy is advantageous for control because it allows for learning in continuous action spaces. Since DDPG is an actor-critic policy gradient algorithm, there is a policy network π(sξ)\pi(s | \xi) with parameters ξ\xi in addition to the action-value network Q(s,aθ)Q(s,a | \theta) with parameters θ\theta.

The training targets for Q are computed as in DQN, with the only difference that π\pi now depends on ξ\xi. Using the mean squared error, we derive the cost function for Q

C(θ  θ,ξ)=1mj(rj+γ Q(sj+1,π(sj+1ξ) θ)yj  Q(sj,π(sjξ) θ))2.C(\theta \ | \ \theta', \xi') = \frac{1}{m} \sum_j \bigg( \underbrace{r_j + \gamma \ Q\big(s_{j+1},\pi(s_{j+1} | \xi') \ \big| \theta'\big)}_{y_j} \ - \ Q\big(s_{j},\pi(s_{j} | \xi) \ \big| \theta \big) \bigg) ^ 2.

As the targets depend on the explicit policy network, we also need the target policy parameters ξ=LP(ξ)\xi' = LP(\xi). Here, LP will be an exponential moving average with update rule ξτξ+(1τ)ξ\xi' \leftarrow \tau \xi + (1-\tau) \xi' and θτθ+(1τ)θ\theta' \leftarrow \tau \theta + (1-\tau) \theta'.

The policy is trained via policy gradient

ξ Q(sj,π(sjξ) θ)=aQ(sj,π(sjξ) θ)ξ π(sjξ),\nabla_\xi \ Q\big(s_j, \pi(s_j | \xi) \ \big| \theta \big) = \nabla_a Q\big(s_j,\pi(s_j | \xi) \ \big| \theta \big) \cdot \nabla_\xi \ \pi(s_j |\xi),

that is, QQ is the cost function for π\pi

C(ξ  θ) = - Q(sj,π(sjξ) θ).C(\xi \ | \ \theta) \ = \ \text{-} \ Q\big(s_j, \pi(s_j | \xi) \ \big| \theta \big).

As actions are continuous, correlated Gaussian noise Mt\mathcal{M}_t is added to the actions to ensure exploration. More specifically, Mt+1ϑMt+N(0,σ)\mathcal{M}_{t+1} \leftarrow \vartheta \cdot \mathcal{M}_t + \mathcal{N}(0,\sigma), where N(0,σ)\mathcal{N}(0,\sigma) is a normally distributed random variable and ϑ\vartheta a hyperparameter controlling the frequency of the noise. Again, this off-policy approach is possible because the algorithm does not learn on trajectories but only on isolated transitions. Algorithm 2 shows the complete learning procedure.

s π Q q a
Figure 3.2: Architecture of DDPG. The policy π\pi is trained by back-propagating the qq-gradient with respect to the action aa.
Algorithm 2: DDPGInitialize replay memory DD Initialize π\pi with random weights ξ\xi and target weights ξξ\xi' \leftarrow \xi Initialize QQ with random weights θ\theta and target weights θθ\theta' \leftarrow \theta for t=1t = 1 to TT do Select action at=π(st)+Mta_t = \pi(s_t) + \mathcal{M}_t Execute ata_t and observe reward rtr_t and next state st+1s_{t+1} Store transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1}) in DD Sample random minibatch of mm transitions from DD Set Q targets yj=rj+γQ(sj+1,π(sj+1ξ) θ)y_j = r_j + \gamma Q(s_{j+1}, \pi(s_{j+1}|\xi') \ | \theta') Perform gradient descent on cost C=1mj(yjQ(sj,ajθ))2C = \frac{1}{m}\sum_j(y_j - Q(s_j,a_j | \theta))^2 with respect to θ\theta Perform gradient ascent on Q(sj,π(sjξ) θ)Q(s_j,\pi(s_j | \xi) \ | \theta) with respect to ξ\xi Update θLP(θ)\theta' \leftarrow \text{LP}(\theta) Update ξLP(ξ)\xi' \leftarrow \text{LP}(\xi) end for

4 Implementation

A big challenge in deep reinforcement learning implementations are efficient neural network routines. The most critical part are the inner products of inputs and weights and the respective derivatives in each layer. We first coded in MATLAB, a popular machine learning tool providing efficient linear algebra routines. Due to the lack of automatic differentiation, we implemented gradient propagation routines, as well as the RMSProp and ADAM optimizers. The code can be found on the CD of this thesis.

However, due to the high computational demands of neural networks, we also developed a TensorFlow implementation. TensorFlow [14] is an open source Python / C++ library providing fast routines (30\sim 30 times faster than MATLAB from our experience) for deep learning, accomplished through GPU support. It features automatic differentiation and therefore allows for much more flexible network and algorithm design. Additionally, TensorFlow provides a wide variety of built-in optimizers (e.g., ADAM) and operations (e.g., batch normalization). The results shown in this thesis have been produced by the TensorFlow implementation, which can be found on the CD of this thesis.

Another concern in setting up the testing framework regarded the ability to run several experiments on computing clusters with job schedulers while having a changing codebase. For this purpose, we wrote ezex, a small framework that provides basic operations (e.g., starting and aborting jobs) and a visualization of running and finished LSF (or SLURM) jobs via Jupyter Notebooks.

Ezex
Figure 4.1: With ezex it is possible to visualize experiments located in a central folder on a (shared) file system. It also provides routines for aborting jobs and deleting experiments.

4.1 DDPG

Our DDPG implementation is as close as possible to the original paper. Its evaluation is performed on low-dimensional true state inputs and not on visual inputs due to time and software limitations. Figure 4.2 shows the layouts of the neural networks used, while Table 4.1 the hyperparameters used. The layers are initialized with small random weights.

State s ReLU 400 Units ReLU 300 Units Tanh Action a
(a) Layout of the policy network.
State s Action a ReLU 400 Units ReLU 300 Units Linear q
(b) Layout of the action-value network.
Figure 4.2: DDPG networks layout. Each network has about 130,000 parameters.
HyperparameterValueNote
Minibatch size mm32
Policy learning rate10410^{-4}Step size for ADAM
Critic learning rate10310^{-3}Step size for ADAM
Policy weight decay00
Critic weight decay10210^{-2}L2 cost for each parameter
Target update rate τ\tau10310^{-3}
Replay memory size500,000500,000Older transitions are discarded
Exploration noise σ\sigma0.2
Exploration noise reversion rate ϑ\vartheta0.15The overall noise variance is 0.13
Reward discount γ\gamma0.99
Warm-up time50,00050,000Timesteps until training starts
Table 4.1: Hyperparameters for DDPG.

4.2 The Cart-Pole Problem

Although the final goal is to apply deep reinforcement learning in real-world settings (e.g., robotics), testing in these environments can be dangerous and expensive both in terms of money and human expert labor. Simulated environments are well suited for testing reinforcement learning systems extensively as the testing process is cheap and can be easily automated. For these reasons, both DQN and DDPG have been originally evaluated in simulated environments. In the original works DQN was tested on several Atari games using the simulator ALE [15], while DDPG has been evaluated in more than 20 physical environments using the MuJoCo [16] simulator. The authors showed that the algorithms and the same hyperparameters work on a wide variety of tasks without the need of hand-tuning them for each individual task. In our implementation, we also used MuJoCo but, because of time constraints, we evaluated DDPG only on the cart-pole (Figure 4.3), a standard reinforcement learning benchmark problem. The task is a standard non-linear control task. The goal is to indirectly swing-up and balance a pole placed on a cart by applying control force to the cart. Environment details (e.g., masses, friction coefficients, etc.) can be found in the MuJoCo model file cartpole.xml.

Cart-pole
Figure 4.3: The cart-pole environment. The state consists of four dimensions: the vertical cart position s0[1,1]s_0 \in [-1,1], the angle of the pole s1[π,π)s_1 \in [-\pi,\pi) and their derivatives s2=s0˙s_2 = \dot{s_0} and s3=s1˙s_3 = \dot{s_1}. In our case the agent is able to fully observe the state.

5 Evaluation

Even though DDPG turned out to be highly sensitive to the reward functions, they have not been included in the original paper. Among the several reward functions we tried, the most reliable one is

r=0.5cos(s1) 0.03a2 0.015s0 0.2s32.r = 0.5 \cdot \cos(s_1) \ - 0.03 \cdot a^2 \ - 0.015 \cdot |s_0| \ - 0.2 \cdot s_3^2 .

Here, cos(s1)\cos(s_1) has the highest influence, giving -1 when the pole is hanging down and 1 when it is standing up. The additional terms consist in penalties for the action aa, the cart position s0s_0 (for not being in the middle) and the rotation speed of the pole s3s_3. However, we noticed that even slight changes to the parameters of the reward functions impaired the learning dramatically.

Below, we report results on varying the reward function and the hyperparameters. In particular, we evaluate the effects of batch normalization, experience replay and the use of a target network. All experiments have been averaged over ten trials.

5.1 Reference Results Without Batch Normalization

Here, we report the best results we achieved. The reward function used is the handcrafted one described above and we did not use batch normalization. With these settings, the agent was consistently able to learn the optimal swing-up and balance policy. The returns and some sample trajectories over the course of the learning process are depicted in Figure 5.1 and Figure 5.2, respectively.

We stress that, even though we used constant episode lengths, shifting the rewards (i.e., adding a constant) had an impact on the learning stability. At first we added a bias to ensure the returns for the initial policy were zero to match the initial QQ values, but having a negative bias (introduced by cos(s1)\cos(s_1)) improved the learning consistency significantly.

cp0
(a) Trajectory after 0 timesteps of learning.
cp1
(b) Swing-up after 10,000\sim 10,000 timesteps.
cp3
(c) Balance after 100,000\sim 100,000 timesteps.
cp4
(d) Near optimal after 200,000\sim 200,000 timesteps.
Figure 5.1: Cart pole test trajectories, showing the action aa (blue), the cart position s0s_0 (green) and the normalized pole angle s1s_1 (red). An angle of -1 or 1 indicates that the pole is hanging down, while 0 that it is standing up.
ret-std
Figure 5.2: Expected return with the handcrafted reward function and without batch normalization. The black line denotes the mean, while the blue area the interquartile range. During the warm-up time the agent is acting according to the initial policy (i.e., randomly) and is only filling the replay memory, therefore the expected return does not improve.

5.2 Applying Batch Normalization

We then evaluated the same reward function with batch normalization. Surprisingly, we observed much worse results, as shown in Figure 5.3. Even though we tried the canonical as well as the non-standard batch normalization used in the DDPG paper, the agent was never able to balance after swinging up. This behavior might be due to the cart-pole environment, as the original paper also reported slightly inferior results for batch normalization on cart-pole. It could also be due to the fact that batch normalization introduces additional noise during learning. A very recent method claiming to alleviate this issue and that could be interesting for future research is weight normalization [17].

ret-bn
Figure 5.3: Expected return with batch normalization. The agent only learns to swing up and fails at balancing.

5.3 Disabling Target Network and Replay Memory

Consistent with the findings from the original paper, we observed reduced performance when disabling the target networks (Figure 5.4) or the replay memory (Figure 5.5).

ret-notarget
Figure 5.4: Expected return without target networks (i.e., setting the target update rate τ=1\tau = 1).
ret-online
Figure 5.5: Expected return without experience replay (i.e., the neural network is trained on the last collected 32 transitions).

5.4 Sparse Reward Functions

As already discussed, we found DDPG to be very sensitive to the reward function. Initially, we were able to learn only with smooth reward functions, while using a sparse reward function encoding only the actual goal (i.e., rewarding the agent only when the pole is standing upright r=(s1<0.01)r = (|s_1|< 0.01)), there was no improvement over the initial policy at all, as shown in Figure 5.6. At first, we thought that this behavior might be due to poor exploration of the state space and that the agent, never reaching the goal state, does not know where to get positive rewards. While this might in fact be an issue in more complicated environments, it was not the issue in the cart-pole. Instead, we found that increasing the magnitude of the reward and increasing the warm-up time helped the agent to learn good action value function and policy, as shown in Figures 5.7, 5.8 and 5.9. In general, extending the warm-up time stabilized learning tremendously. This behavior might be due to the fact that transitions from initially diverging policies have a lower chance to get sampled from the replay memory since the replay memory is already filled with many transitions from the warm-up time.

ret-hardlow
Figure 5.6: Expected return with r=(s1<0.01)0.03a2r = (|s_1|< 0.01)-0.03 \cdot a^2 and $t_\text{warmup} = 10,000$. With a sparse reward function and a short warm-up time, there is no learning at all.
ret-hard10k
Figure 5.7: Expected return with r=4(s1<0.01)0.03a2r = \mathbf{4} \cdot (|s_1|< 0.01)-0.03 \cdot a^2 and $t_\text{warmup} = 10,000$. Increasing the magnitude of the positive reward helped the agent, although the learned policy is far from the optimal one.
ret-hard50k
Figure 5.8: Expected return with r=4(s1<0.01)0.03a2r = \mathbf{4} \cdot (|s_1|< 0.01)-0.03 \cdot a^2 and $t_\text{warmup} = \mathbf{50,000}$. Increasing the warm-up time further increased the performance.
ret-hard100k
Figure 5.9: Expected return with r=4(s1<0.01)0.03a2r = \mathbf{4} \cdot (|s_1|< 0.01)-0.03 \cdot a^2 and $t_\text{warmup} = \mathbf{100,000}$. With both a high-magnitude positive reward function and a sufficiently large warm-up time, the agent is able to always learn a near optimal policy.

6 Conclusion and Future Work

With this thesis, we have provided an analysis of the state-of-the-art deep reinforcement learning algorithm DDPG. The results from the original papers have been reproduced and we successfully trained an agent for solving the cart-pole balance and swing-up tasks with continuous actions. We provided an analysis of the performance for sparse reward functions and evaluated the use of batch normalization, target network and replay memory.

We showed that DDPG was able to learn a non-trivial control policy, but there are several limitations that need to be solved before it will be possible to apply it to robotics. Below, we identify the major weaknesses and propose future avenues of research.

6.1 Data Efficiency

The first significant issue of deep reinforcement learning is data efficiency. In our test cases we experienced that in order to learn high-quality policies, the algorithm had to be fed with a high number of samples. Considering that real control problems are much harder than the simple task we evaluated, DDPG is not ready for real world problems.

There are, however, several papers improving DQN and some of the techniques proposed can be applied to DDPG with slight modifications. Prioritized experience replay [18] is for instance a technique for improving data efficiency by not sampling transitions uniformly from the replay memory but instead sampling according to the importance of a transition. Using the temporal difference error as a measure of importance, the authors achieved new state-of-the-art results on the DQN Atari benchmark games.

An alternative way to reduce the number of parameters (and therefore the number of required samples) is to move the first layers of the both the policy and the QQ networks to a common preprocessing network ϕ\phi, as shown in Figure 6.1. Since neurons in the first layers have to be useful to many other neurons in the following layers, they tend to learn task independent features anyway (e.g., for visual inputs these are typically edge filters as in Figure 2.3) and could therefore be combined in a single network. Such a network would then be trained by the gradients from both π\pi and QQ. Having ϕ\phi would also allow us to introduce additional function approximators for the value function V(s)V(s) or to learn model st+1=M(st,at)s_{t+1} = M(s_t,a_t) without much parameter overhead. Furthermore, learning a model and a value function could serve as a good regularizer for ϕ\phi since transition data (model) and value data are currently unused in the DDPG framework.

s φ π Q q a
Figure 6.1: DDPG with a preprocessing network ϕ\phi.

Moreover the model and the value function can be used to estimate another policy gradient, since the expected return can be also expressed via E[R]=r(st,π(st))+γ V(M(st,π(st)))\mathbb{E}[R] = r(s_t,\pi(s_t)) + \gamma \ V( M(s_t,\pi(s_t))). This approach is called value gradient. Because it assumes a deterministic model, it is only applicable in deterministic environments. However, there have been recent advances in expressing stochasticity in neural networks [19]. Building on this idea, Heess et al. [20] presented Stochastic Value Gradients, extending the value gradient approach to stochastic neural network models and policies. Even though these approaches have not been covered in this thesis, they seem a promising direction for further research.

6.2 Exploration

Another possible improvement to DDPG regards more efficient exploration. More specifically, stochastic policies that are able to encode multiple good actions instead of relying on ϵ\epsilon-greedy exploration can speed up learning exponentially as Osband et al. [21] showed for DQN. As already mentioned in the introduction, rewarding exploration can be a way to enable the agent to actively explore. By learning a transition model for DQN and using the model error as a proxy for exploration (i.e., rewarding according to the model error) Stadie et al. [22] were able to improve performance especially where the original DQN performed poorly.

6.3 Imitation

In robotics, a simpler way to adopt complex behavior or to bootstrap the learning is imitation, consisting in using target trajectories for supervised learning of policies. In the case of DDPG, imitation could be employed by either initializing the networks by supervised training or by injecting target trajectories into the replay memory. For example, recently Silver et al. [2] were tremendously successful in the game of Go by first pre-training a policy network on human expert moves and subsequently improving it via reinforcement learning.

With regard to robotics, another helpful approach to avoid dangerous policies would be to initialize the Q-network with negative values for specific trajectories (e.g., for the violation of the joint limits).

6.4 Curriculum Learning

Curriculum learning [23] is a deep learning approach to learn complex functions by training on easier and more fundamental examples first and only gradually introducing more complicated examples. This could be also applied to reinforcement learning and especially robotics. For a robot the most fundamental concepts to learn are its own body dynamics and the skill to move without hurting itself. For example in table tennis, a robot could first learn to explore itself carefully (e.g., giving it rewards for exploration or punishing violation of joint limits) and subsequently learn striking movements and to win a match.

References

  1. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
  2. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
  3. Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. (2006). Autonomous inverted helicopter flight via reinforcement learning. In Experimental Robotics IX, 363–372. Springer.
  4. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.
  5. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 1097–1105.
  6. Kingma, D., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  7. Ioffe, S., and Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  8. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229–256.
  9. Baxter, J., and Bartlett, P. L. (2000). Direct gradient-based reinforcement learning. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 3, 271–274.
  10. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
  11. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In ICML.
  12. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  13. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  14. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  15. Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
  16. Todorov, E., Erez, T., and Tassa, Y. (2012). MuJoCo: a physics engine for model-based control. In Proceedings of IROS 2012, 5026–5033. IEEE.
  17. Salimans, T., and Kingma, D. P. (2016). Weight normalization: a simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868.
  18. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
  19. Kingma, D. P., and Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
  20. Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2926–2934.
  21. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped DQN. arXiv preprint arXiv:1602.04621.
  22. Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814.
  23. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 41–48. ACM.