Applied Stats 9/10 Ryan Adams - gov3009-l

8 Sep 2014

Hi everyone!

Our speaker this Wednesday (9/10) at Applied Stats will be* Ryan Adams, *an
Assistant Professor of Computer Science at SEAS. His research focuses on
machine learning and computational statistics, but he is broadly interested
in questions related to artificial intelligence, computational
neuroscience, machine vision, and Bayesian nonparametrics.

Ryan will be giving a talk entitled *Accelerating Exact MCMC with Subsets
of Data**. *The abstract for the talk is included below. As per usual, we
will meet in CGIS K354 at 12 noon and lunch will be served.

I look forward to seeing you all there! Also, check out the new website (
here <http://projects.iq.harvard.edu/applied.stats.workshop-gov3009/>) to
see the schedule for the first few weeks. Thank you!

-- Dana Higgins

*Abstract: *

One of the challenges of building statistical models for large data sets is
balancing the correctness of inference procedures against computational
realities.  In the context of Bayesian procedures, the pain of such
computations has been particularly acute as it has appeared that algorithms
such as Markov chain Monte Carlo necessarily need to touch all of the data
at each iteration in order to arrive at a correct answer.  Several recent
proposals have been made to use subsets (or "minibatches") of data to
perform MCMC in ways analogous to stochastic gradient descent.
 Unfortunately, these proposals have only provided approximations, although
in some cases it has been possible to bound the error of the resulting
stationary distribution.

In this talk I will discuss two new, complementary algorithms for using
subsets of data to perform faster MCMC.  In both cases, these procedures
yield stationary distributions that are exactly the desired target
posterior distribution.  The first of these, "Firefly Monte Carlo", is an
auxiliary variable method that uses randomized subsets of data to achieve
valid transition operators, with connections to recent developments in
pseudo-marginal MCMC.  The second approach I will discuss, parallel
predictive prefetching, uses subsets of data to parallelize Markov chain
Monte Carlo across multiple cores, while still leaving the target
distribution intact.  These methods have both yielded significant gains in
wallclock performance in sampling from posterior distributions with
millions of data.