Comparing record linkage software programs and algorithms using real-world data

Autoři: Alan F. Karr aff001;  Matthew T. Taylor aff002;  Suzanne L. West aff001;  Soko Setoguchi aff002;  Tzuyung D. Kou aff004;  Tobias Gerhard aff002;  Daniel B. Horton aff002
Působiště autorů: RTI International, Research Triangle Park, NC, United States of America aff001;  Center for Pharmacoepidemiology and Treatment Science, Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States of America aff002;  Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States of America aff003;  Bristol-Myers Squibb, Hopewell, NJ, United States of America aff004
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
doi: 10.1371/journal.pone.0221459


Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.

Klíčová slova:

Algorithms – Computer software – Engines – Inpatients – Outpatients – Software tools – Ensemble methods


1. West of Scotland Coronary Prevention Study Group. Computerised record linkage: compared with traditional patient follow-up methods in clinical trials and illustrated in a prospective epidemiological study. J Clin Epidemiol. 1995;48(12):1441–52. PubMed doi: 10.1016/0895-4356(95)00530-7 8543958.

2. McDonald L, Schultze A, Carroll R, Ramagopalan SV. Performing studies using the UK Clinical Practice Research Datalink: to link or not to link? Eur J Epidemiol. 2018;33(6):601–5. doi: 10.1007/s10654-018-0389-5 29619668.

3. Cornish RP, Macleod J, Carpenter JR, Tilling K. Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study. Emerg Themes Epidemiol. 2017;14:14. doi: 10.1186/s12982-017-0068-0 29270206; PubMed Central PMCID: PMC5735815.

4. Patorno E, Gopalakrishnan C, Franklin JM, Brodovicz KG, Masso-Gonzalez E, Bartels DB, et al. Claims-based studies of oral glucose-lowering medications can achieve balance in critical clinical variables only observed in electronic health records. Diabetes Obes Metab. 2018;20(4):974–84. doi: 10.1111/dom.13184 29206336.

5. Moore CL, Amin J, Gidding HF, Law MG. A new method for assessing how sensitivity and specificity of linkage studies affects estimation. PLoS One. 2014;9(7):e103690. Epub 2014/07/30. doi: 10.1371/journal.pone.0103690 25068293; PubMed Central PMCID: PMC4113448.

6. Baldi I, Ponti A, Zanetti R, Ciccone G, Merletti F, Gregori D. The impact of record-linkage bias in the Cox model. J Eval Clin Pract. 2010;16(1):92–6. doi: 10.1111/j.1365-2753.2009.01119.x 20367819.

7. Harron KL, Doidge JC, Knight HE, Gilbert RE, Goldstein H, Cromwell DA, et al. A guide to evaluating linkage quality for the analysis of linked data. International Journal of Epidemiology. 2017;46(5):1699–710. doi: 10.1093/ije/dyx177 29025131

8. Zhu Y, Matsuyama Y, Ohashi Y, Setoguchi S. When to conduct probabilistic linkage vs. deterministic linkage? A simulation study. J Biomed Inform. 2015;56:80–6. doi: 10.1016/j.jbi.2015.05.012 26004791.

9. Fellegi IP, Sunter AB. A theory for record linkage. Journal of the American Statistical Association. 1969;64(328):1183–210. doi: 10.1080/01621459.1969.10501049

10. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;39(1):1–38.

11. Aldridge RW, Shaji K, Hayward AC, Abubakar I. Accuracy of probabilistic linkage using the enhanced matching system for public health and epidemiological studies. PLoS One. 2015;10(8):e0136179. doi: 10.1371/journal.pone.0136179 26302242; PubMed Central PMCID: PMC4547731.

12. Ford JB, Roberts CL, Taylor LK. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Epidemiol. 2006;20(4):329–37. doi: 10.1111/j.1365-3016.2006.00715.x 16879505.

13. Hagger-Johnson G, Harron K, Fleming T, Gilbert R, Goldstein H, Landy R, et al. Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records. BMJ Open. 2015;5(8):e008118. doi: 10.1136/bmjopen-2015-008118 26297363; PubMed Central PMCID: PMC4550723.

14. Lariscy JT. Differential record linkage by Hispanic ethnicity and age in linked mortality studies: implications for the epidemiologic paradox. J Aging Health. 2011;23(8):1263–84. doi: 10.1177/0898264311421369 21934120; PubMed Central PMCID: PMC4598042.

15. Schmidlin K, Clough-Gorr KM, Spoerri A, Egger M, Zwahlen M, Swiss National C. Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak. 2013;13:1. doi: 10.1186/1472-6947-13-1 23289362; PubMed Central PMCID: PMC3547805.

16. Campbell KM, Deck D, Krupski A. Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm. Health Informatics J. 2008;14(1):5–15. doi: 10.1177/1460458208088855 18258671.

17. Ferrante A, Boyd J. A transparent and transportable methodology for evaluating Data Linkage software. J Biomed Inform. 2012;45(1):165–72. doi: 10.1016/j.jbi.2011.10.006 22061295.

18. Contiero P, Tittarelli A, Tagliabue G, Maghini A, Fabiano S, Crosignani P, et al. The EpiLink record linkage software: presentation and results of linkage test on cancer registry files. Methods Inf Med. 2005;44(1):66–71. 15778796.

19. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. Epub 2007/10/04. doi: 10.2202/1544-6115.1309 17910531.

20. McIllece J, Kapani V. A simplified approach to administrative record linkage in the quarterly census of employment and wages. JSM Proceedings: Survey Research Methods. 2014:4392–404.

21. Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. Int J Epidemiol. 2016;45(3):954–64. doi: 10.1093/ije/dyv322 26686842; PubMed Central PMCID: PMC5005943.

22. Krewski D, Dewanji A, Wang Y, Bartlett S, Zielinski JM, Mallick R. The effect of record linkage errors on risk estimates in cohort mortality studies. Statistics Canada. 2005;31(1):13–21.

23. Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10:346. doi: 10.1186/1472-6963-10-346 21176171; PubMed Central PMCID: PMC3271236.

24. Karr AF, Sanil AP, Banks DL. Data quality: A statistical perspective. Statistical Methodology. 2006;3:137–73.

25. Randall SM, Ferrante AM, Boyd JH, Semmens JB. The effect of data cleaning on record linkage quality. BMC Med Inform Decis Mak. 2013;13:64. doi: 10.1186/1472-6947-13-64 23739011; PubMed Central PMCID: PMC3688507.

26. Gilbert R, Lafferty R, Hagger-Johnson G, Harron K, Zhang LC, Smith P, et al. GUILD: GUidance for Information about Linking Data sets. J Public Health (Oxf). 2018;40(1):191–8. doi: 10.1093/pubmed/fdx037 28369581; PubMed Central PMCID: PMC5896589.

27. Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361:k1479. doi: 10.1136/bmj.k1479 29712648; PubMed Central PMCID: PMC5925441 at and declare: all authors had financial support from the National Institutes of Health for this study.

28. Harron K, Hagger-Johnson G, Gilbert R, Goldstein H. Utilising identifier error variation in linkage of large administrative data sources. BMC Med Res Methodol. 2017;17(1):23. doi: 10.1186/s12874-017-0306-8 28173759; PubMed Central PMCID: PMC5297137.

Článek vyšel v časopise


2019 Číslo 9

Nejčtenější v tomto čísle

Tomuto tématu se dále věnují…


Zvyšte si kvalifikaci online z pohodlí domova

Ulcerative colitis_muž_břicho_střeva
Ulcerózní kolitida
nový kurz

Blokátory angiotenzinových receptorů (sartany)
Autoři: MUDr. Jiří Krupička, Ph.D.

Antiseptika a prevence ve stomatologii
Autoři: MUDr. Ladislav Korábek, CSc., MBA

Citikolin v neuroprotekci a neuroregeneraci: od výzkumu do klinické praxe nejen očních lékařů
Autoři: MUDr. Petr Výborný, CSc., FEBO

Zánětlivá bolest zad a axiální spondylartritida – Diagnostika a referenční strategie
Autoři: MUDr. Monika Gregová, Ph.D., MUDr. Kristýna Bubová

Všechny kurzy
Kurzy Doporučená témata