Maximilian Igl

Machine Learning and Deep Reinforcement Learning

I'm a fourth year student in Oxford with Shimon Whiteson at the Whirl-Group, mainly interested in deep reinforcement learning. Recently, I was lucky to have worked with Sam Devlin (at MSR Cambridge) and Nicolas Heess (at Deepmind) during two internships. Originally, my background is in Physics and Economics, both at the University of Munich (LMU), as well as Technology Management.

Google Scholar | Twitter | Linkedin | Github | Email

Research

My work focusses on improving transferability and generalization capabilities of reinforcement learning agents by leveraging ideas from hierarchical reinforcement learning, variational autoencoders and information theory. I'm most exited about deepening our understanding of RL and neural networks and developing new methods based on those insights.

Below is a list of selected papers. For a full list, please see my Google Scholar profile.

The Impact of Non-stationarity on Generalization in Deep Reinforcement Learning

Abstract

Non-stationarity arises in Reinforcement Learning (RL) even in stationary environments. Most RL algorithms collect new data throughout training, using a non-stationary behavior policy. Furthermore, training targets in RL can change even with a fixed state distribution when the policy, critic, or bootstrap values are updated. We study these types of non-stationarity in supervised learning settings as well as in RL, finding that they can lead to worse generalization performance when using deep neural network function approximators. Consequently, to improve generalization of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialized network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalization benchmarks ProcGen and Multiroom.

M. Igl, G. Farquhar, J. Luketina, W. Boehmer, S. Whiteson | arXiv | code

Non-stationarity arises in Reinforcement Learning (RL) even in stationary environments, for example because we collect data using a constantly changing policy. In this work, we investigate how this affects generalization in RL and propose a new method, called Iterated Relearning (ITER), to improve generalization.

Non-stationarity has minimal effect on final training performance...
Non-stationarity has minimal effect on final training performance...
... but a large effect on test performance, i.e. generalization.
... but a large effect on test performance, i.e. generalization.
Evaluation of ITER on unseen
Evaluation of ITER on unseen ProcGen test-levels.

Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck

Abstract

The ability for policies to generalize to new environments is key to the broad application of RL agents. A promising approach to prevent an agent's policy from overfitting to a limited set of training environments is to apply regularization techniques originally developed for supervised learning. However, there are stark differences between supervised learning and RL. We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL. In particular, we focus on regularization techniques relying on the injection of noise into the learned function, a family that includes some of the most widely used approaches such as Dropout and Batch Normalization. To adapt them to RL, we propose Selective Noise Injection (SNI), which maintains the regularizing effect the injected noise has, while mitigating the adverse effects it has on the gradient quality. Furthermore, we demonstrate that the Information Bottleneck (IB) is a particularly well suited regularization technique for RL as it is effective in the low-data regime encountered early on in training RL agents. Combining the IB with SNI, we significantly outperform current state of the art results, including on the recently proposed generalization benchmark Coinrun.

Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K NeurIPS 2019 | arXiv | code

We explore the idea of using stochastic regularization, in particular the idea of an information bottleneck (as implemented by the DVIB) in the agent architecture to improve generalization to previously unseen levels in the Multiroom and Coinrun environments. To make it work well, we propose Selective Noise Injection to trade off regularization with training stability. The Gifs below show example rollouts of our agent (left) vs. previous state of the art (right) on unseen levels.

image
image
image

Multitask Soft Option Learning

Abstract

We present Multitask Soft Option Learning (MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. MSOL avoids several instabilities during training in a multitask setting and provides a natural way to learn both intra-option policies and their terminations. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines in challenging multi-task environments.

Igl, M., Gambardella, A., He, J., Nardelli, N., Siddharth, N., Böhmer, W., & Whiteson, S. arXiv

image

We combine ideas from Planning as Inference and hierarchical latent variable models to learn temporally extended skills (called options). Given a set of different tasks, our approach allows to extract skills which are most useful across the range of tasks, i.e. most re-usable. Furthermore, we propose the idea of soft options, which allows to successfully apply previously learned skills, even in settings in which they are no longer optimal.

Deep Variational Reinforcement Learning (DVRL) for POMDPs

Abstract

Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose deep variational reinforcement learning (DVRL), which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an n-step approximation to the evidence lower bound (ELBO), allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past.

Igl, M., Zintgraf, L., Le, T. A., Wood, F., & Whiteson, S. ICML 2018 | arXiv | code

image
image

If the environment is only partially observable (which is typically the case in the real world), the agent has to reason about the possible underlying states of the world. To do so better, we propose Deep Variational Reinforcement Learning (DVRL) which learns a model of the world concurrently with the RL training and uses it to perform inference, i.e. to compute the agent's belief about its current situation.

Auto-Encoding Sequential Monte Carlo (AESMC)

Abstract

We build on auto-encoding sequential Monte Carlo (AESMC): a method for model and proposal learning based on maximizing the lower bound to the log marginal likelihood in a broad family of structured probabilistic models. Our approach relies on the efficiency of sequential Monte Carlo (SMC) for performing inference in structured probabilistic models and the flexibility of deep neural networks to model complex conditional probability distributions. We develop additional theoretical insights and introduce a new training procedure which improves both model and proposal learning. We demonstrate that our approach provides a fast, easy-to-implement and scalable means for simultaneous model learning and proposal adaptation in deep generative models.

Le, T. A., Igl, M., Rainforth, T., Jin, T., & Wood, F. ICLR 2018 | arXiv

image

We propose to combine Variational Autoencoders (VAEs) with Sequential Monte Carlo (SMC): When using VAEs for time-series data, using only one Monte Carlo sample can result in extremely high variance. Even using multiple samples with Importance Weighting might not be sufficient. Instead, we propose to use SMC, which uses clever resampling for variance reduction.