CSC2541 Scalable and Flexible Models of Uncertainty (Fall 2017)

Course Information:

Email: rgrosse at cs dot toronto dot edu (put CSC2541 in the subject line)

Lecture: Friday, 2-4pm, in Bahen 1220

Office Hours:

Tuesday, 1-3pm, in Pratt 290F (that’s the D. L. Pratt Building, not the E. J. Pratt Library!)
You need to book a time slot through a URL which will be given out during class.
Your team should book a double time slot the week of your presentation. Please bring a draft of your presentation.

Overview

Over the last 5 years or so, neural networks have driven rapid progress in applications as diverse as vision, language understanding, speech understanding, robotics, chemistry, and game playing. But a major challenge remains: modeling uncertainty, i.e. knowing what one doesn’t know. Good models of uncertainty are crucial whenever an algorithm needs to manage exploration, i.e. decide how and when to acquire new information. The topic of uncertainty also rears its head in the context of adversarial examples, a recently discovered phenomenon which originally seemed like a curiosity, but now seems to be a serious security vulnerability for modern ML systems which has so far resisted all attempts to defeat it.

The first half of the course will cover a set of algorithmic tools for modeling uncertainty: Gaussian processes, Bayesian neural nets, and variational inference. We will focus on continuous models and the setting of function approximation, in order to avoid overlap with other iterations of this course (see below). I.e., we will have little if any coverage of generative models or discrete latent variables. The second half of the course will cover applications of uncertainty modeling: neural net sparsification, active learning, black-box optimization, reinforcement learning, and adversarial robustness.

Prerequisites

This course is designed to bring students to the current frontier of knowledge on these methods, so that ideally, their course projects can make a novel contribution. A previous background in machine learning such as CSC411 or ECE521 is strongly recommended. Linear algebra, basic multivariate calculus, basics of working with probability, and programming skills are required.

Relationship with other courses

Since uncertainty is a central topic in machine learning, it’s unsurprising that there are lots of courses at UofT focusing on it. Several departments offer core courses focused on probabilistic machine learning, with varying emphases:

Those courses are all lecture-based, and aim to give broad coverage of probabilistic modeling techniques. They focus on principles which are pretty well understood.

By contrast, CSC2541 is a topics course which aims to bring you to the research frontier on certain topics. Many of the topics have not yet been distilled in an easily accessible form, and a lot of the key experimental findings are still open to multiple interpretations. While the core courses listed above cover a variety of techniques, here we focus on a smaller set (mostly Gaussian processes, Bayesian neural nets, and variational inference) and look at how they can be applied in situations that depend on accurate uncertainty modeling.

CSC2541 is a topics course which is offered repeatedly but with different topics. I’m only planning to offer this version of the course once. David Duvenaud recently taught a version of 2541 focused on generative models, and this coming winter, he’ll be teaching a topics course focused on learning discrete structure. I’ve tried to minimize the overlap with those courses, so if you’ve taken 2541 before, you should be able to take it again without being bored. Hence, this course doesn’t have much material on generative models or discrete latent variables. I think there are roughly 2 weeks of overlap with last year’s iteration of 2541: variational inference and model-based reinforcement learning.

Course structure

After the first two lectures, each week a different team of students will present on the readings for that week. I’ll provide guidance about the content of these presentations.

In-class discussion will center around:

Understanding the strengths and weaknesses of these methods.
Understanding the relationships between these methods, and with previous approaches.
Extensions or applications of these methods.
Experiments that might better illuminate their properties.

The hope is that these discussions will lead to actual research papers, or resources that will help others understand these approaches.

Grades will be based on:

Class presentations - 20%
Project proposal - 20% - due Oct. 12
Final project presentation, report, and code - 60%
- presentations Nov. 24 and Dec. 1
- Project report due ~~Dec. 10~~ Dec. 13

Project

You are asked to carry out an original research project related to the course content and write a (roughly 8 page) report. You’re encouraged to work in teams of 3-5. See here for more specific guidelines.

Calendar:

	Topic	Readings
9/15	Overview [Slides]	Overview Ghahramani, 2015. Probabilistic machine learning and artificial intelligence. Bayesian regression MacKay, 1992. Bayesian interpolation. Rasmussen and Ghahramani, 2001. Occam's razor. Review: Bayesian parameter estimation, Bayesian linear regression Calibration Guo et al., 2017. On calibration of modern neural networks.
9/22	Gaussian Processes [Slides]	Foundations GPML, Chapter 2 Structured kernels skim GPML sections 4.1-4.2 GPML, Chapter 5 David Duvenaud's kernel cookbook We did not get to the following two, so they will be covered in a later lecture: Duvenaud et al., 2013. Structure discovery in nonparametric regression through compositional kernel search Wilson and Adams, 2013. Gaussian process kernels for pattern discovery and extrapolation
9/29	Bayesian Neural Nets [Slides]	Background backpropagation Metropolis-Hastings Foundations MacKay, 1992. A practical Bayesian framework for backpropagation networks. Neal, 1995. Bayesian learning for neural networks. Chapter 2. Hamiltonian Monte Carlo Neal, 2012. MCMC using Hamiltonian dynamics. (focus on sections 1-3) Stochastic gradient Langevin dynamics Welling and Teh, 2011. Bayesian learning via stochastic gradient Langevin dynamics. Balan et al., 2015. Bayesian dark knowledge
10/6	Variational Inference for BNNs [Slides]	Background variational Bayes Kingma and Welling, 2014. Auto-encoding variational Bayes. Variational inference for BNNs Hinton and van Camp, 1993. Keeping the neural networks simple by minimizing the description length of the weights. Graves, 2011. Practical variational inference for neural networks Kingma, Salimans, and Welling, 2015. Variational dropout and the local reparameterization trick. (optional) Gal and Ghahramani, 2016. Dropout as Bayesian approximation: representing model uncertainty in deep learning Sparsification Louizos et al., 2017. Bayesian compression for deep learning.
10/12	Project proposals due!	Send by e-mail to csc2541-submit at cs dot toronto dot edu. Include "CSC2541 Project Proposal" in subject line.
10/13	Variational Inference for GPs [Slides: 1, 2]	Note: this is among the most mathematically demanding sessions, and the rest of the course doesn't build much on it, so don't get bogged down in the details. Natural gradient and stochastic variational inference Natural gradient tutorial: see Piazza under "Resources" Hoffman, Blei, Wang, and Paisley, 2013. Stochastic variational inference Note: this material is included because it's used by Hensman et al., and also because natural gradient and SVI are just good things to know about. Sparse GPs Titsias, 2009. Variational learning of inducing variables in sparse Gaussian processes Bauer, van der Wilk, and Ghahramani, 2016. Understanding probabilistic sparse Gaussian process approximations. Hensman, Fusi, and Lawrence, 2013. Gaussian processes for big data. Variational inference and generalization Seeger, 2002. PAC-Bayesian generalization error bounds for Gaussian process classification (Section 3 can be skimmed)
10/20	Exploration I: Active Learning and Bandits [Slides]	Active Learning MacKay, 1992. Information-based objective functions for active data selection. Graves et al., 2017. Automated curriculum learning for neural networks. Bandits Auer et al., 2002. Finite-time analysis of the multiarmed bandit problem. (optional) Kocsis and Szepesvari, 2006. Bandit based Monte-Carlo planning. Russo et al., 2017. A tutorial on Thompson sampling.
10/27	Exploration II: Bayesian Optimization [Slides]	Bayesian optimization Snoek, Larochelle, and Adams, 2012. Practical Bayesian optimization of machine learning algorithms. Srinivas et al., 2010. Gaussian process optimization in the bandit setting: no regret and experimental design. (optional) Snoek et al., 2015. Scalable Bayesian optimization using deep neural networks. Exploiting structure Swersky, Snoek, and Adams, 2013. Multi-task Bayesian optimization Swersky, Snoek, and Adams, 2014. Freeze-thaw Bayesian optimization (optional) Gardner et al., 2017. Discovering and exploiting additive structure for Bayesian optimization.
11/3	Exploration III: Reinforcement Learning [Slides]	Model-free Osband et al., 2016. Deep exploration via bootstrapped DQN. Houthooft et al., 2016. VIME: Variational information maximizing exploration. (optional) Fortunato et al., 2017. Noisy networks for exploration. Model-based Deisenroth and Rasmussen, 2011. PILCO: a model-based and data-efficient approach to policy search Depeweg et al., 2016. Learning and policy search in stochastic dynamical systems with Bayesian neural networks
11/10	Adversarial Robustness [Slides]	Goodfellow, Shlens, and Szegedy, 2014. Explaining and harnessing adversarial examples. (recommended) Ian Goodfellow's NIPS tutorial Papernot et al., 2016. Practical black-box attacks against machine learning systems using adversarial examples Kos, Fischer, and Song, 2017. Adversarial examples for generative models Papernot et al., 2015. The limitations of deep learning in adversarial settings Bradshaw, Matthews, and Ghahramani, 2017. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks
11/17	Optimization [Slides]	Martens and Grosse, 2015. Optimizing neural networks with Kronecker-factored approximate curvature.
11/24 and 12/1	Project Presentations
~~12/10~~ 12/13	Project reports due	Send by e-mail to csc2541-submit at cs dot toronto dot edu. Include "CSC2541 Final Report" in subject line.

Software:

Here is some software you may find helpful for your projects:

deep learning frameworks
- TensorFlow (probably has the most relevant software for this course)
- PyTorch
- Theano
- Autograd (lightweight autodiff framework; easier to experiment with than the other frameworks, but CPU-based)
probabilistic programming languages
- Stan (the most widely used PPL; uses HMC, but somewhat of a black box)
- Edward (a PPL aimed at researchers, based on HMC and stochastic variational inference)
Gaussian processes
- GPy
- GPflow
Bayesian optimization
- Spearmint
- HPOLib
reinforcement learning
- OpenAI Gym
- OpenAI Baselines
adversarial robustness
- CleverHans
- Foolbox

University of Toronto, Fall 2017