Hi everyone!

Our speaker this Wednesday (9/10) at Applied Stats will be Ryan Adams, an Assistant Professor of Computer Science at SEAS. His research focuses on machine learning and computational statistics, but he is broadly interested in questions related to artificial intelligence, computational neuroscience, machine vision, and Bayesian nonparametrics.

Ryan will be giving a talk entitled Accelerating Exact MCMC with Subsets of Data. The abstract for the talk is included below. As per usual, we will meet in CGIS K354 at 12 noon and lunch will be served.

I look forward to seeing you all there! Also, check out the new website (here) to see the schedule for the first few weeks. Thank you!

-- Dana Higgins

Abstract:

One of the challenges of building statistical models for large data sets is balancing the correctness of inference procedures against computational realities. In the context of Bayesian procedures, the pain of such computations has been particularly acute as it has appeared that algorithms such as Markov chain Monte Carlo necessarily need to touch all of the data at each iteration in order to arrive at a correct answer. Several recent proposals have been made to use subsets (or "minibatches") of data to perform MCMC in ways analogous to stochastic gradient descent. Unfortunately, these proposals have only provided approximations, although in some cases it has been possible to bound the error of the resulting stationary distribution.

In this talk I will discuss two new, complementary algorithms for using subsets of data to perform faster MCMC. In both cases, these procedures yield stationary distributions that are exactly the desired target posterior distribution. The first of these, "Firefly Monte Carlo", is an auxiliary variable method that uses randomized subsets of data to achieve valid transition operators, with connections to recent developments in pseudo-marginal MCMC. The second approach I will discuss, parallel predictive prefetching, uses subsets of data to parallelize Markov chain Monte Carlo across multiple cores, while still leaving the target distribution intact. These methods have both yielded significant gains in wallclock performance in sampling from posterior distributions with millions of data.