3.4 Processing
3.4.5 Record linkage

Text begins

Record Linkage is the process in which records or units from different data sources are joined together into a single file using non-unique identifiers, such as names, date of birth, addresses and other characteristics. It is also known as data matching, data linkage, entity resolution and many other terms depending on the fields it’s been used. The initial idea of record linkage goes back to the 1950s, then this technique has been applied by people from a wide range of areas, such as data warehousing and business intelligence, historical research, and medical practice and research.

Matching has a long history of uses in statistical surveys and administrative data development. In Statistics Canada, record linkage is used in creating a sampling frame, removing duplicates from files, providing extra information to assist data processing, or combining files so that relationships on two or more data elements from separate files can be studied. For example:

  • A business register consisting of names, addresses, and other identifying information such as complete financial information might be constructed from tax and employment databases.
  • A survey of retail establishments or agricultural establishments might combine results from an area frame and a list frame. To produce a combined estimator, units from the area frame would need to be identified in the list frame.
  • The coverage of the Census of the population can be measured by linking the Census records to other sources of administrative data and by estimating the percentage of individuals found in one source but not in the other source.

Types of Linkage

There are two types of record linkage: exact matching and statistical matching. Exact matching can be divided into two subtypes: deterministic record linkage and probabilistic record linkage, as illustrated by figure 3.4.5.1 below.

Figure 3.4.5.1 Types of record linkage

Description for Figure 3.4.5.1

The figure is a hierarchical diagram showing the relation between the different types of linkage.

Statistical Matching

The purpose of statistical matching is to create a file reflecting the underlying population distribution. Records that are combined do not necessarily correspond to the same entity, such as a person or a business. The files that are being matched can have different units but referring to the same population. It is assumed that the relationship of the variables in the population will be similar to the relationship on the files. This method is mainly used in market research and seldom used by official statistical agencies.

Exact Matching

The goal of exact matching is to link information about a particular record in one file to information on a secondary file in order to create a single file with correct information for each record. The linkage is performed at the level of record, such as a link of mortality records to the population census.

Deterministic Record Linkage

This is the simplest form of record linkage, which produces links based on common identifiers or variables among the available data sources. It is often the case that no single variable exists that is free of errors, presents on the majority of data and has enough discrimination power. Only a combination of variables will be able to discriminate between two records. This is one technique often used by official statistical agencies. Statistics Canada uses this method for building its business, address and population registers, which involve in multiple survey operations subsequently.   

Probabilistic Record Linkage

This is another type of exact matching. Like in the other case, there is no unique identifier available for matching. Unlike the deterministic matching, probabilistic can compensate if the information is incomplete or/and subject to error. Records, which are not in complete agreement for each variable, can be linked together to build a set of potential pairs. A score is then calculated for each potential pair. After that, a linkage status is assigned to each potential pair based on the score.  

Remark

There are numerous factors to consider to determine which type of record linkage to use, such as the purpose of the linkage, type of data, cost, time, confidentiality, acceptable precision level and type of error. In general, deterministic matching is less computer intensive but it involves more manual intervention. Probabilistic linkage is more time consuming and computer intensive, and will require specialized software. However, it generally produces more reliable results than deterministic linkage.


Date modified: