Hi everyone!
This week at the Applied Statistics Workshop we will be welcoming *Isaac
Kohane*, Professor of Pediatrics at Harvard Medical School. He will be
presenting work entitled* Interesting early results in data science that
require new statistical methods*. Please find the abstract below and on
the Applied Stats website here
<https://projects.iq.harvard.edu/applied.stats.workshop-gov3009>.
As usual, we will meet at noon in CGIS Knafel Room 354 and lunch will be
provided. See you all there!
-- Dana Higgins
*Title:* *Interesting early results in data science that require new
statistical methods*
*Abstract:* In the brave new world of biomedical data science, new
sources of data emerge seemingly every year from Twitter to genomes to
weather to drug habits and/or doctor preferences. I will outline several
interesting and apparently impactful findings that have emerged as a result
of analyses of these data, both individually and jointly. I will then
follow the discussion of these early successes with an outline of
significant unanswered methodological challenges requiring a systematic and
sound response if further progress is to be achieved. In particular, I will
focus on those challenges which I believe are of the most interest to the
biostatistical community.
Hi everyone!
This week at the Applied Statistics Workshop we will be welcoming *Emily
Breza*, Assistant Professor of Economics at Harvard University. She will be
presenting work entitled *Using Aggregated Relational Data to Feasibly
Identify Network Structure without Network Data*. Please find the abstract
below and on the Applied Stats website here
<https://projects.iq.harvard.edu/applied.stats.workshop-gov3009>.
As usual, we will meet at noon in CGIS Knafel Room 354 and lunch will be
provided. See you all there!
-- Dana Higgins
*Title:* * Using Aggregated Relational Data to Feasibly Identify Network
Structure without Network Data *
*Abstract:* Social network data is often prohibitively expensive to
collect, limiting empirical network research. Typical economic network
mapping requires (1) enumerating a census, (2) eliciting the names of all
network links for each individual, (3) matching the list of
social connections to the census, and (4) repeating (1)-(3) across many
networks. In settings requiring field surveys, steps (2)-(3) can be very
expensive. In other network populations such as financial intermediaries or
high-risk groups, proprietary data and privacy concerns may render (2)-(3)
impossible. Both restrict the accessibility of high-quality networks
research to investigators with considerable resources.
We propose an inexpensive and feasible strategy for network elicitation
using Aggregated Relational Data (ARD) – responses to questions of the form
“How many of your social connections have trait k?” Our method uses ARD to
recover the parameters of a general network formation model, which in turn,
permits the estimation of any arbitrary node- or graph-level statistic. The
method works well in simulations and in matching a range of network
characteristics in real-world graphs from 75 Indian villages. Moreover, we
replicate the results of two field experiments that involved collecting
network data. We show that the researchers would have drawn similar
conclusions using ARD alone. Finally, using calculations from J-PAL
fieldwork, we show that in rural India, for example, ARD surveys are 80%
cheaper than full network surveys.
Hi everyone!
This week at the Applied Statistics Workshop we will be welcoming *Lucas
Janson*, Assistant Professor of Statistics at Harvard University. He will
be presenting work entitled *Using Knockoffs to Find Important Variables
with Statistical Guarantees*. Please find the abstract below and on the
Applied Stats website here
<https://projects.iq.harvard.edu/applied.stats.workshop-gov3009>.
As usual, we will meet at noon in CGIS Knafel Room 354 and lunch will be
provided. See you all there!
-- Dana Higgins
*Title:* *Using Knockoffs to find important variables with statistical
guarantees*
*Abstract:* Many contemporary large-scale applications, from genomics to
advertising, involve linking a response of interest to a large set of
potential explanatory variables in a nonlinear fashion, such as when the
response is binary. Although this modeling problem has been extensively
studied, it remains unclear how to effectively select important variables
while controlling the fraction of false discoveries, even in
high-dimensional logistic regression, not to mention general
high-dimensional nonlinear models. To address such a practical problem, we
propose a new framework of model-X knockoffs, which reads from a different
perspective the knockoff procedure (Barber and Candès, 2015) originally
designed for controlling the false discovery rate in linear models. Model-X
knockoffs can deal with arbitrary (and unknown) conditional models and any
dimensions, including when the number of explanatory variables p exceeds
the sample size n. Our approach requires the design matrix be random
(independent and identically distributed rows) with a known distribution
for the explanatory variables, although we show preliminary evidence that
our procedure is robust to unknown/estimated distributions. As we require
no knowledge/assumptions about the conditional distribution of the
response, we effectively shift the burden of knowledge from the response to
the explanatory variables, in contrast to the canonical model-based
approach which assumes a parametric model for the response but very little
about the explanatory variables. To our knowledge, no other procedure
solves the controlled variable selection problem in such generality, but in
the restricted settings where competitors exist, we demonstrate the
superior power of knockoffs through simulations. Finally, we apply our
procedure to data from a case-control study of Crohn’s disease in the
United Kingdom, making twice as many discoveries as the original analysis
of the same data.
Hi everyone!
This week at the Applied Statistics Workshop we will be welcoming *Jose
Zubizaretta*, Assistant Professor of Health Care Policy at Harvard Medical
School. He will be presenting work entitled *Building Representative
Matched Samples in Large-Scale Observational Studies with Multivalued
Treatments*. Please find the abstract below and on the Applied Stats
website here
<https://projects.iq.harvard.edu/applied.stats.workshop-gov3009>.
As usual, we will meet at noon in CGIS Knafel Room 354 and lunch will be
provided. See you all there!
-- Dana Higgins
*Title:* *Building Representative Matched Samples in Large-Scale
Observational Studies with Multivalued Treatments *
*Abstract:* In observational studies of causal effects, matching methods
are widely used to approximate the ideal study that would be conducted
under controlled experimentation. In this talk, I will discuss new matching
methods that use tools from modern optimization to overcome four
limitations of standard matching approaches. In particular, these new
matching methods (i) directly obtain flexible forms of covariate balance,
as specified before matching by the investigator; (ii) produce
self-weighting matched samples that are representative of target
populations by design; and (iii) handle multiple treatment doses without
resorting to a generalization of the propensity score. (iv) These methods
can handle large data sets quickly. I will illustrate the performance of
these methods in a case studies about the impact of an earthquake on
post-traumatic stress and standardized test scores.