Dear Applied Statistics Workshop Community,
Our next meeting of the semester will be on April 26 (12:00 EST). Dean Knox
and Guilherme Duarte will present "Optimal Allocation of Data-Collection
Resources."
<Where>
CGIS K354
Bagged lunches are available for pick-up at 11:45 (CGIS K354).
Zoom:
https://harvard.zoom.us/j/99181972207?pwd=Ykd3ZzVZRnZCSDZqNVpCSURCNnVvQT09
<Abstract>
Complications in applied work often prevent researchers from obtaining
unique point estimates of target quantities using cheaply available data—at
best, ranges of possibilities, or sharp bounds, can be reported. To make
progress, researchers frequently collect more information by (1)
re-cleaning existing datasets, (2) gathering secondary datasets, or (3)
pursuing entirely new designs. Common examples include manually correcting
missingness, recontacting attrited units, validating proxies with
ground-truth data, finding new instrumental variables, and conducting
follow-up experiments. These auxiliary tasks are costly, forcing tradeoffs
with (4) larger samples from the original approach. Researchers'
data-collection strategies, or choices over these tasks, are often based on
convenience or intuition. In this work, we show how to provably identify
the most cost-efficient data-collection strategy for a given research
problem.
We quantify the quality of existing data using the width of the confidence
regions on the sharp bounds, which captures two sources of uncertainty:
statistical uncertainty due to finite samples of the variables measured,
and fundamental uncertainty because some variables are not measured at all.
We then show how to compute the expected information gain, defined as the
expected amount by which each data-collection task will narrow these bounds
by addressing one or both sources of uncertainty. Finally, we select the
task with the greatest information efficiency, or gain per unit cost.
Leveraging recent advances in automatic bounding (Duarte et al., 2022), we
prove efficiency is computable for essentially any discrete causal system,
estimand, and auxiliary data task.
Based on this theoretical framework, we develop a method for optimal
adaptive allocation of data-collection resources. Users first input a
causal graph, estimand, and past data. They then enumerate distributions
from which future samples can be drawn, fixed and per-sample costs, and any
prior beliefs. Our method automatically derives and sequentially updates
the optimal data-collection strategy.
<2022-2023 Schedule>
GOV 3009 Website:
https://projects.iq.harvard.edu/applied.stats.workshop-gov3009
Calendar:
https://calendar.google.com/calendar/embed?src=c_3v93pav9fjkkldrbu9snbhned8…
Best,
Shusei