Dear all,
We hope you can join us this Wednesday, November 7, 2012 for the Applied
Statistics Workshop from 12-1.30 pm. Jon Bischof, a Ph.D. candidate from
the Department of Statistics at Harvard University, will give a
presentation entitled "Summarizing Topical Content in Document Collections
with Word Frequency and Exclusivity". A light lunch will be served at 12 pm
and the talk will begin at 12.15.
*Abstract: *
An ongoing challenge in the analysis of document
collections is how to
summarize content in terms of a set of inferred themes that can be
interpreted substantively in terms of topics. However, the current practice
of summarizing themes in terms of most frequent words limits
interpretability by ignoring the differential use of words across topics.
We argue that words that are both frequent and exclusive to a theme are
more effective at characterizing topical content. We consider a setting
where professional editors have annotated documents to a collection of
topic categories, organized into a tree, in which leaf-nodes correspond to
the most specific topics. Each document is annotated to multiple
categories, at different levels of the tree. We introduce Hierarchical
Poisson Convolution (HPC) as a model to analyze annotated documents in this
setting. The model leverages the structure among categories defined by
professional editors to infer a clear semantic description for each topic
in terms of words that are both frequent and exclusive. We develop a
parallelized Hamiltonian Monte Carlo sampler that allows the inference to
scale to millions of documents.
An up-to-date schedule for the workshop is available at
http://www.iq.harvard.edu/events/node/1208.
Best,
Konstantin
--
Konstantin Kashin
Ph.D. Candidate in Government
Harvard University
Mobile: 978-844-0538
E-mail: kkashin(a)fas.harvard.edu
Site:
http://www.konstantinkashin.com/<http://people.fas.harvard.edu/%7Ekkashi…