OCR Text |
Show Probabilistic Data Linking in Public Health Research by Gulzar H. Shah, Ph. D. Introduction In recent years, compelling need for comprehensive information sources on access and quality of health care has emerged due to escalating health care costs, and a growing concern over quality of health care. Health care researchers can rely on generalized record linking technology to bridge some of the information gap. For this reason, linking or matching two or more databases containing records that are believed to relate to a common group of individuals, families, businesses, addresses, events, objects, or other such entities is gaining expanding popularity in the health data research and management arena. Linked databases are a valuable tool that yield more comprehensive information than a single database alone. Data linkage helps create composite databases, enabling examination of new relationships and testing of new hypotheses in health policy and medical performance research and evaluation without having to collect new data sets. Data linking or matching means taking separate data files containing information on individual cases, based on one or more variable or fields (such as name, date of birth, social security number etc.--also called identifiers) identifying which records in both data files pertain to the same individuals, and merging these records. Deterministic vs Probabilistic Data Linkage If the databases to be linked have a common unique identifier (e.g. social security number) that is error-free and omission-free, they can be easily linked. Such a linkage strategy which tends to match exactly on a unique identifier or a fixed criteria is termed "deterministic" linking. Most public health databases cannot be linked using the deterministic approach. Even when common unique variables are present, they tend to have missing data, and inconsistent variable contents and formats across databases. For instance, the same individuals may have been identified differently in two data files due to spelling errors in name, clerical errors in data entry, inconsistent abbreviations (e.g., of a city name) or inconsistent use of pseudonyms and diminutives of given names. When common identifiers are not unique, linking databases is extremely difficult if not impossible without the use of a sophisticated linking technique known as "probabilistic" linking. Probabilistic linkage is a data linkage method which allows for certain expected variation in some criteria and is based on a calculated uncertainty in the match. Automated probabilistic linkage of records is accomplished by specialized software, such as AutoMatch 7 or GIRLS (Jaro, 1995; Nanan and White, 1997). Probabilistic linkage involves allocation of weights for each pair of records based on the comparison between predefined criteria and rules. For each comparison a positive weight is assigned if the two values (of the comparison variable) agree or partially agree and a negative weight is assigned if they differ substantially. These comparison weights are added for each record and a pair (or set) of records is classified as a "match" if the weight is above a predefined threshold. The records with weights between match threshold and predefined "cut-off" value are subjected to clerical review and are manually classified as match or non-match by eyeballing the values of various variables for records being compared. If the value of the comparison weight is below the "cut-off," the pair is rejected as a match and is thrown into a pool of residuals that are possible links in the next pass. Several passes using different combinations of blocking and matching variables are performed in order to extract as many links as possible. Detailed description of the methodology behind probabilistic linkage is beyond the scope of this brief article. Jaro (1995) has recaptured a detailed account of methodology and steps involved in data linkage. History Data linkage is not a new concept, though it has gained momentum in recent years. Work of Newcombe and Kennedy (1959) is considered pioneering in computerized probabilistic data linking. Newcombe and Kennedy (1962) offered a conceptual foundation for weights (degree of certainty of a match) based on probabilities of chance agreement of pair of records being compared. Expanding on Newcombe and Kennedy's work, Fellegi and Sunter (1969) formalized the mathematical concepts and laid out the widely accepted theory of record linkage. They outlined a detailed protocol for optimal decision making based on cutoff threshold (weights) computed from acceptable u and m probabilities. A wealth of information on development and use of linked data systems emerged through proceedings from the "Symposium on Quantitative Methods for Utilization of Multi-Source Data in Public Health", published in the March 15 - April 15, 1995 issue of Statistics in Medicine. 128 |