Patient Matching Project


Mentor: Shaun Grannis Contributor: James Egg Intern: Sarp Centel


Contents

Introduction

Record linkage is the task of identifying pieces of scattered information that refer to the same thing. Patient matching is a specific application, in which we try to identify records that belong to the same patient among different data sources. These sources can range from patient data collected at different hospitals to external information from governmental institutions, such as death master file etc.

One of the interesting and challenging aspects of this project is to deal with erroneous data, for instance when your name is misspelled or your birth date is entered incorrectly. These kinds of things often happen in reality, and we can account for them by using flexible distance metrics and statistical models.

Why is then record linkage important and what are the benefits?

Well, we are living in an exciting period of globalization, where computers and internet make world-wide collaboration easy and necessary. Patient linkage and data aggregation techniques will allow medical institutions to store their own data, yet at the same time work together with others to offer better treatment to patients.

For instance, patients often forget their test results at home, or old tests get lost eventually. Imagine that all your medical records are stored in digital format, and when you go to Hospital A, a doctor there can examine your tomogram taken 4 years ago at Hospital B where your name was misspelled by the clerk

I hope that record linkage functionality will be a step forward to increase collaboration between OpenMRS implementers.

Schedule

Week 9 (23-27 July)

Step 1-3

Week 10 (30-4 August)

Step 4

Week 11 (6-10 August)

Step 5

Week 12 (13-17 August)

Step 5 and documenting, testing, debugging, polishing previous work

Week 13 (20-24 August)

Students upload code to code.google.com/hosting; mentors begin final evaluations; students begin final program evaluations

Roadmap for second semester

Step 4: Test the analytic and scoring components using the existing test data files. This involves at least the following tasks:

1) Since no tool exists today to create the config file, manually create XML configuration file for each of the test data sets. There should be at least 2 XML config files for each test data set: one with linkage runs specified and one without. The config files without linkage runs specified should by default analyze all fields. The config file with linkage runs specified should only analyze those fields specified in the linkage runs.

2) Calculate token frequencies for each test data set using both config files described above.

3) Validate the actual frequency counts by implementing a separate validation tool (eg, a perl script to count frequencies) and applying diff to the frequency lists generated by java and Perl. For small test data sets, the frequency counts can be manually reviewed for accuracy.

4) Validate test set record-pair scores using scaled and non-scaled weights; compare scaled and non-scaled record-pair scores. Review original and modified weights for each scaled field and manually validate a random sample of scaled weights.

5) Validate the following scaling options in similar fashion: 1) ignore/use nulls, 2) Top-N/Bottom-N

Step 5: Re-factor RecMatch record linkage GUI, incorporating the new OpenMRS linkage module objects. While the linkage objects will clearly function within the OpenMRS, they will also likely function independently in a thick client, such as a Java Swing application. We should re-use the old RecMatch GUI code where practical and feasible. The first goal of refactoring will be to generate an XML configuration file.

Currently working on

I'm currently working on Step 4.

What has been done so far

In the second part of SoC, I have completed the following parts:

Step 1: Develop an analytic function to count token frequencies for different fields in a data source and store this statistical data in data structures for use by record-pair scoring methods. The token frequency counting functionality will be incorporated into an Analytic object. The record linkage framework can have multiple analytic objects (eg, objects to count token frequency, calculate entropy, etc.) These analytic objects not only generate and store statistical data, but some aspects of their functionality (such as data structure information – eg, JDBC connection information) may be accessed through a Modifier interface in the Analytic object.

Step 2: Implement frequency-based weight scaling to be incorporated with the scorePair functionality. This involves implementing the actual scoring equation in the Analyzer object through a score Modifier interface that scorePair can access. The Analayzer object and Modifier interface will provide access to the data structures containing the token frequencies generated in Step 1.

Step 3: Revise XML configuration file to accommodate weight scaling. As discussed previously, the new configuration file will have 3 general sections. The first section contains metadata describing the actual data source and specifies the properties of each potential linking field. The second section contains metadata describing the analytic data. E.g., this section would refelct the relational tables and their connection properties for data generated by the frequency analysis, and other future analyses. The third section describes each individual "linkage run”. Each linkage run is specified by a) the blocking fields; b) the fields used to establish linkage score; c) the type of string comparator to use; and d) the analytic modifier to use -- currently weight scaling and is the only analytic modifier, and in order for weight scaling to be applied, the fields must match exactly.

In the first part of SoC, I have completed Phase 1, 2 and 3 of our project plan. Here is what they are about:

Phase 1

Implement and analytic object that performs the following tasks:

a. Read linkage configuration file and determine which fields/columns need to be scaled.

b. Connect to data source containing individual tokens. Assume that the data source is either a flat file such as a CSV file or a relational database.

c. Add new information to configuration file that indicates location of token frequencies

d. Count frequencies of individual tokens.

e. Store token frequency results in persistent structure (eg, a relational database table). In order to access the token frequency data at runtime, the frequency tables need to be identified in the configuration file. Thus, will need to develop a programmatic scheme to identify each token frequency table associated with a given data source


Here is a draft on how to store analytical phase results in a relational database table:

Image:Patientmatchingdb.png

Right click here and choose "Save Target As" to load it into DBDesigner 4


Phase 2

Implement functionality instantiating a data structure that provides fast lookup of individual token frequencies. This data structure will likely be a hash table, where the key is the token value (eg, last name of “SMITH”) and the value is the token frequency (eg, 2,102). This data will be loaded from the persistent data structure created in task 1(e).

Because the primary performance constraint for weight-based frequency scaling will be the lookup, we will need to be able to configure the number of elements loaded into the hash table. For example, it is likely that some fields will have hundreds of thousands of unique tokens (eg, name fields), while others will have on the order of 10 or 20 (middle initial, month of birth).

Also, weight scaling can be used to either increase or decrease individual field weights. If an individual token frequency is less than the average frequency it will be increased, if it is above the average frequency it will be decreased. Consequently, there needs to be some ability to configure the total number of tokens loaded into the lookup structure for each field.

a. Implement functionality to load top ‘N’ most/least frequent tokens from the persistent data structure, where top, bottom, and ‘N’ are specified in the configuration file.

b. Other (future) options may include top or bottom N%, frequencies above or below N.He

Phase 3

a) Modify the ScorePair method to incorporate frequency scaling. This process should be performed incrementally, in two phases.

The first phase will hard code the frequency scaling equation, into the existing ScorePair method. Once the entire linkage process (from analytics to operational phase) has been tested and successfully implements frequency scaling as a prototype, we will proceed to phase

b) Re-factoring ScorePair class to accommodate a framework that accepts future modifications to linkage scores established by the Felligi-Sunter model. These modifications include the frequency scaling, and will also include modifying the agreement weight based on the degree of string similarity as established by various string comparators.

Patient Matching Blog

<xfeeds titlecolour="#B0C4DE" contentcolour="#eeeeee" feedlimit="5" totallimit="10"> http://soc.sarpcentel.com/feed/ </xfeeds>

Resources

Browse source code