Patient Matching Design


Patient matching "brainstorm design document" for the Newborn Screening Initiative and also for incorporation into OpenMRS.

Contents

Chief Functionalities

Two primary functionalities to build:

ONE: A findMatch function will return probable matches for a given record. The function will be passed at least three types of parameters:

  1. A vector containing the identifiers and demographics for an entity/individual (components include name, DOB, etc.)
  2. Analytic metadata reflecting statistical characteristics of the data sources to be queried. Analytic metadata is previously derived from statistical analysis of the data source to be queried. This information will be passed to scorePair below.
  3. Data source to be queried.

TWO: A scorePair function will evaluate two records that may represent the same person/entity. scorePair will return likelihood scores and statistical data (including estimated true positive and false positive match rates) for the pair. scorePair will receive at a minimum the following parameters:

  1. 2 vectors representing records that are to be evaluated as a potential match. (components include name, DOB, etc.)
  2. Analytic metadata reflecting statistical characteristics of the data sources to be queried. Analytic metadata is previously derived from statistical analysis of the data source to be queried.

scorePair is agnostic to the data source containing the identifiers and demographics. For example, we may be matching multiple patients from two flat files, or may be querying a database for a single entity, etc. Consequently, managing single or multiple records to be evaluated will be abstracted outside of the scorePair function.

INPC newborn screening: A pilot project that will utilize this functionality

  1. Identifiers and demographics from incoming Riley HL7 newborn screening data will be stored in a scratch database.
  2. Identifiers and demographics from incoming INPC HL7 will be extracted.
  3. INPC HL7 identifiers will be pre-processed to remove “junk” (test messages, JOHN DOE, INFANT, BABYBOY, DOB 1-1-1900, etc.) Pre-processing rules (for now) will be ad-hoc, but would like to (eventually) create standardized representation (e.g. Arden) and standardized methods for generating preprocessing rules.
  4. Validated identifiers will be passed to a findMatch functionality as a vector, along with analytic metadata, and a reference to the scratch database. The analytic metadata will be created by the record linkage analytic tool.
  5. For each blocking variable combination (defined in the analytic metadata), the scratch database (containing Riley newborn screening results) will be queried for records matching the blocking variables.
  6. Each record returned from the scratch database will be paired to the INPC record and evaluated using a scorePair functionality.
  7. scorePair will return statistical data (pair score, metadata) for each record pair.
  8. Post-processing:
    1. Record pairs will be post-processed to remove false positives (e.g. twins, other familial linkages). Post-processing rules (for now) will be ad-hoc, but would like to (eventually) create standardized representation (e.g. Arden) and standardized methods for generating post-processing rules.
    2. (For discussion) Those record pairs having a score lower than the “true positive” threshold for the given blocking combination can be excluded.
  9. findMatch will return post-processed records to calling function. NULL if no match found.

Specific design criteria for pilot project steps above:

  1. Identifiers and demographics (all upper case) to be stored in the Riley (scratch) database include:
    1. medical record number
    2. last name
    3. first name
    4. birth year
    5. birth month
    6. birth day
    7. gender (M/F)
    8. 3-digit area code (digits only)
    9. 7-digit telephone (digits only)
    10. city
    11. state
    12. 5-digit ZIP (not ZIP+4)
    13. physician last name
    14. physician first name
    15. Next of Kin last name
    16. Next of Kin first name
    17. Next of Kin middle name
  2. Identifiers and demographics to be extracted from INPC HL7 are same as above
  3. Preprocessing thoughts:
    1. Below are examples of heuristics to entirely exclude incoming HL7 messages:
      • LN and FN null
      • LN = 'AAA' and FN contains 'DUPL'
      • LN = 'CHECK-NAME'
      • LN = 'DONOTUSE'
      • LN = 'NONAME'
      • LN = 'REUSE'
      • LN begins with "X-"
      • LN has > 1 digit or FN has > 1 digit
      • LN = 'BUSINESS' and FN =
      • LN ='UNIDENTIFIED' and FN =
      • LN starts with 'UNK' and FN =
      • LN = 'DOE' and FN starts with 'J'
      • LN starts with 'J' and FN = 'DOE'
      • LN = 'BUS' and FN = 'OFF'
      • LN = 'BOO' and FN = 'BOO'
    2. The following are example rules to clean individual fields:
      • Nullify FN = {‘INFANT’, ‘BABY’, ‘BOY’, ‘GIRL’, ‘BABYBOY’, ‘BABYGIRL’}
      • Nullify DOB = ‘1/1/1900’
      • Nullify telephone with ‘99999’ or ‘00000’
      • Nullify ZIP = ‘00000’ or ‘99999’
  4. The analytical metadata contains the following:
    • Agreement rates for true-matches (m statistic) and agreement rates for non-matches (u statistic) for each field, by blocking combination.
    • Data types for each field as either numeric or alpha-numeric (string).
    • Comparator to be used for each field (e.g., exact match, Jaro-Winkler, Levenstein, or longest common substring).
    • Total number of records in data source
    • Total number of unique non-null values for each field
    • Total number of null values for each field
    • Flag indicating whether to scale agreement weight based on term frequency (default-no)
    • Flag indicating whether to use null frequency when scaling agreement weight based on term frequency (default-no)
    • Flag indicating how to treat nulls when one or both fields are null (e.g. apply disagreement weight, apply agreement weight, or apply zero weight) (default-apply zero weight)
  5. Review the analytic file structure to make sure that all necessary data is included in the metafile for scorePair to run its process.
  6. Need to specify how findMatch will connect to scratch database (e.g. JDBC).
  7. Write query to pull all record matching the block criteria.
  8. Need to specify the output format of scorePair (e.g. <record pair>, <pair score>, <TP rate>, <FP rate>)
  9. Below are examples of exclusion heuristics to post-process potential false positive matches:
    • FN’s disagree and SSN’s disagree
    • Genders both not null and Genders disagree
    • FN’s disagree and Genders disagree
    • FN’s disagree and at least one SSN null
  10. Gender Imputation:
      • For cases where 1 or both genders are missing, impute gender based on common first name lists (available from census.gov).
      • If both imputed genders agree, then genders agree.

API

public class Person {

  String id;
  Hashtable<String, String> demographics;

}
public class Match {

  String id;
  int confidence;

}
public interface MatchingService {

  void addToken(String tokenName, MatchingService.STRING) throws DuplicateMatchingTokenException;

  void add(Person p);
  void update(Person p);
  void delete(Person p);

  List<Match> findSensitiveMatches(Person p, int threshhold);

  List<Match> findSensitiveMatches(List<Person> list, int threshhold);

  List<Match> findSpecificMatches(Person p, int threshhold);

  List<Match> findSpecificMatches(List<Person> list, int threshhold);
}