Hi all,
This week at the Applied Statistics workshop we will be welcoming Kosuke Imai, a Professor in the Department of Politics and Center for Statistics and Machine Learning at Princeton University. He will be presenting work entitled "Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records." Please find the abstract below and on the website.
We will meet in CGIS Knafel Room 354 at noon and lunch will be provided.
Best,
Pam
Title: Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records
Abstract:
Since most social science research relies upon multiple data
sources, merging data sets is an essential part of workflow for many
researchers. In many situations, however, a unique identifier that
unambiguously links data sets is unavailable and data sets may
contain missing and inaccurate information. As a result,
researchers can no longer combine data sets ``by hand'' without
sacrificing the quality of the resulting merged data set. This
problem is especially severe when merging large-scale administrative
records such as voter files. The existing algorithms to automate the
merging process do not scale, result in many fewer matches, and
require arbitrary decisions by researchers. To overcome this
challenge, we develop a fast algorithm to implement the canonical
probabilistic model of record linkage for merging large data sets.
Researchers can combine this model with a small amount of human
coding to produce a high-quality merged data set. The proposed
methodology can handle millions of observations and account for
missing data and auxiliary information. We conduct simulation
studies to show that our algorithm performs well in a variety of
practically relevant settings. Finally, we use our methodology to
merge the campaign contribution data (5 million records), the
Cooperative Congressional Election Study data (50 thousand records),
and the nationwide voter file (160 million records).
Hi all,
This week at the Applied Statistics workshop we will be welcoming Elizabeth Stuart, a Professor and Associate Dean for Education at Johns Hopkins School of Public Health. She will be presenting work entitled "Estimating population effects: Assessing and enhancing the generalizability of randomized trials to target populations." Please find the abstract below and on the website.
We will meet in CGIS Knafel Room 354 at noon and lunch will be provided.
Best,
Pam
Title: Estimating population effects: Assessing and enhancing the generalizability of randomized trials to target populations
Abstract: With increasing attention being paid to the relevance of studies for real-world practice (such as in education, international development, and comparative effectiveness research), there is also growing interest in external validity and assessing whether the results seen in randomized trials would hold in target populations. While randomized trials yield unbiased estimates of the effects of interventions in the sample of individuals (or physician practices or schools) in the trial, they do not necessarily inform about what the effects would be in some other, potentially somewhat different, population. While there has been increasing discussion of this limitation of traditional trials, relatively little statistical work has been done developing methods to assess or enhance the external validity of randomized trial results. This talk will first provide empirical data on the potential size of external validity bias in education research. It will then discuss design and analysis methods for assessing and increasing external validity, as well as general issues that need to be considered when thinking about external validity. The primary analysis approach discussed will be a reweighting approach that equates the sample and target population on a set of observed characteristics. Underlying assumptions, performance in simulations, and limitations will be discussed. Implications for how future studies should be designed (and what data needs to be collected) in order to enhance the ability to assess generalizability will also be discussed.
Hi all,
This week at the Applied Statistics workshop we will be welcoming Paramveer Dhillon, a Postdoctoral Fellow at the MIT Sloan School of Management and the Initiative on Digital Economy at MIT. He will be presenting work entitled "Linear Methods for Big Data." Please find the abstract below and on the website.
We will meet in CGIS Knafel Room 354 at noon and lunch will be provided.
Best,
Pam
Title: Linear Methods for Big Data
Abstract: Statistical machine learning has seen great advances in the last decade owing to the availability of large-scale annotated datasets and significant improvements in computation hardware. Amidst this measurement revolution, it has become increasingly important to come up with statistical methods that are not only statistically efficient but that are also computationally efficient i.e. they run fast. Drawing on these developments and recent advances in random matrix theory, I will present my work on building fast and theoretically sound methods for linear regression (OLS) and canonical correlation analysis (CCA). I will also describe how these methods can be used to generate linear features that give a state-of-the-art performance on several natural language processing tasks.