Dear Applied Statistics Workshop Community,

Our next meeting of the semester will be on April 26 (12:00 EST). Dean Knox and Guilherme Duarte will present "Optimal Allocation of Data-Collection Resources."

<Where>
CGIS K354
Bagged lunches are available for pick-up at 11:45 (CGIS K354).
Zoom: https://harvard.zoom.us/j/99181972207?pwd=Ykd3ZzVZRnZCSDZqNVpCSURCNnVvQT09

<Abstract>
Complications in applied work often prevent researchers from obtaining unique point estimates of target quantities using cheaply available data—at best, ranges of possibilities, or sharp bounds, can be reported. To make progress, researchers frequently collect more information by (1) re-cleaning existing datasets, (2) gathering secondary datasets, or (3) pursuing entirely new designs. Common examples include manually correcting missingness, recontacting attrited units, validating proxies with ground-truth data, finding new instrumental variables, and conducting follow-up experiments. These auxiliary tasks are costly, forcing tradeoffs with (4) larger samples from the original approach. Researchers' data-collection strategies, or choices over these tasks, are often based on convenience or intuition. In this work, we show how to provably identify the most cost-efficient data-collection strategy for a given research problem.

We quantify the quality of existing data using the width of the confidence regions on the sharp bounds, which captures two sources of uncertainty: statistical uncertainty due to finite samples of the variables measured, and fundamental uncertainty because some variables are not measured at all. We then show how to compute the expected information gain, defined as the expected amount by which each data-collection task will narrow these bounds by addressing one or both sources of uncertainty. Finally, we select the task with the greatest information efficiency, or gain per unit cost. Leveraging recent advances in automatic bounding (Duarte et al., 2022), we prove efficiency is computable for essentially any discrete causal system, estimand, and auxiliary data task.

Based on this theoretical framework, we develop a method for optimal adaptive allocation of data-collection resources. Users first input a causal graph, estimand, and past data. They then enumerate distributions from which future samples can be drawn, fixed and per-sample costs, and any prior beliefs. Our method automatically derives and sequentially updates the optimal data-collection strategy.

<2022-2023 Schedule>
GOV 3009 Website: https://projects.iq.harvard.edu/applied.stats.workshop-gov3009
Calendar: https://calendar.google.com/calendar/embed?src=c_3v93pav9fjkkldrbu9snbhned8%40group.calendar.google.com&ctz=America%2FNew_York

Best,
Shusei