B1380
Title: A Bayesian approach to streaming multi-file record linkage
Authors: Ian Taylor - Colorado State University (United States) [presenting]
Andee Kaplan - Colorado State University (United States)
Brenda Betancourt - NORC at the University of Chicago (United States)
Abstract: Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field in the records. In streaming record linkage, files arrive in time and estimates of links are desired after the arrival of each file. This problem arises in settings such as longitudinal surveys. The challenge in streaming record linkage is efficiently updating parameter estimates as new files arrive. We approach the problem from a Bayesian perspective with estimates in the form of posterior samples of parameters and present a method for updating link estimates after the arrival of a new file that is faster than starting an MCMC from scratch. We generalize a Bayesian Fellegi-Sunter model for two files and apply Sequential Markov Chain Monte Carlo for streaming sample updates. We examine the effect of the prior distribution and the strength of the prior information on the resulting estimates. We apply this method to simulated data and data from the Social Diagnosis Survey of Polish households.