Overview

The purpose of the SDLE program is to facilitate pan-Canadian social and economic statistical research. It is a record linkage environment that:

  • increases the relevance of existing Statistics Canada surveys without collecting new data (including maintaining the relevance of completed longitudinal surveys);
  • substantially increases the use of administrative data;
  • generates new information without additional data collection;
  • maintains the highest privacy and data security standards; and
  • promotes a standardized approach to record linkage processes and methods.

Benefits and public good

Fill data gaps: Studies conducted through the SDLE have the potential to address important information gaps related to the financial, social, economic and general activities and conditions of Canadians.

Reduce response burden: Through record linkage, important data needs in the analysis of social data can be met without incurring the cost or response burden of collecting new data.

Reduce record linkage costs: The SDLE process surrounding the preparation and management of files for record linkage is more efficient and timely through the use of a processing system and the retention of cumulative linkage results.

How it works

The SDLE is a highly secure environment that facilitates the creation of linked population data files for social analysis. It is not a large integrated data base.

At the core of the SDLE is a Derived Record Depository (DRD), essentially a national dynamic relational data base containing only basic personal identifiers. The DRD is created by linking selected Statistics Canada Definitions:source index files for the purpose of producing a list of unique individuals. These files are brought into the environment, processed and linked only once to the DRD. Each individual in the DRD is assigned an SDLE identifier. Some of the source index files used to build the DRD include tax records, vital statistics registration records (births and deaths), and immigrant data. Updates to these data files are linked to the DRD on an ongoing basis.

Only basic personal identifiers are stored in the DRD. Examples of personal identifiers stored in the DRD include surnames, given names, date of birth, sex, insurance numbers, parents' names, marital status, addresses (including postal codes), telephone numbers, immigration date, emigration date and date of death.

The paired SDLE identifiers and source index file record IDs resulting from the record linkage are stored in a Definitions:Key Registry. All source index files are linked to the DRD either probabilistically using a generalized software tool (G-Link) or deterministically using SAS scripts.

Deterministic record linkage involves matching records based on unique identifiers shared by both files. On the other hand, probabilistic record linkage works with non-unique identifiers (e.g. names, sex, date of birth and postal code) and estimates the likelihood that records are referring to the same entity.

Once a study requiring linked data has been defined and approved, the associated record IDs (extracted from the Key Registry) are used to find the individual records in the Definitions:source data files. Selected variables from these sources can then be integrated into a linked analysis file. This approach provides a virtual linkage environment that eliminates the need to build a large integrated data base.

Figure 1. Social Data Linkage Environment overview diagram

SDLE Overview diagram
Description for Figure 1: Social Data Linkage Environment overview diagram

This figures is a visual model that serves as a summary of the text of this overview page.

  • Within the secure data environment at Statistics Canada, source files are separated into Source Data Files (record IDs and analysis variables without personal identifiers) and Source Index Files (record IDs and personal identifiers without analysis variables).
  • The Source Index Files are accessed within the record linkage production environment and linked to the Derived Record Depository (national longitudinal file of personal identifiers). The linked SDLE and record IDs are stored in the Key Registry (record IDs used as keys to find only those records needed for study).
  • The Source Data Files are accessed within the linked analysis file production environment that uses keys from the Key Registry to create analysis files for approved studies only and with no personal identifiers.
  • The SDLE program is governed by the Statistics Canada senior management. The Chief Statistician reviews and approves each record linkage proposal, and if the study is approved by the Chief Statistician, an analysis file is created.
  • The output of this process is an Analytical Product (non-confidential aggregate data).

Data sources

The Definitions: Derived record depository (DRD)DRD contains only record IDs and identifiers without analysis data. The principal Definitions:source index files that contribute to build (i.e. add individual records) and update (i.e. provide additional information to existing records) the DRD include:

  • T1 Personal Master Files (tax);
  • Canadian Child Tax Benefits (CCTB) files;
  • Canadian Vital Statistics – Birth database;
  • Landing File; and
  • Canadian Vital Statistics – Death database.

Other sources will be used to create linked analysis files for approved projects (some of which may also be used to update the DRD). See DRD linkage status.

In the future, additional files could be linked to the DRD. These could be data already residing in Statistics Canada or external files brought in for specific approved research projects.

Statistics Canada has responsibility for securely storing and processing data. Because SDLE research projects involve the use of linked micro-records, approval by the Chief Statistician of Canada on a study-by-study basis is required in accordance with the Directive on Microdata Linkage. Summaries of approved record linkages are published on the Statistics Canada website.

Linked analysis files

When a research project requiring linked data from the SDLE has been approved and linked in the SDLE production environment, the record IDs for the specified cohort and the associated record IDs of the file(s) to be linked to the cohort are drawn from the Definitions:Key Registry. These record IDs are used to bring selected variables from the separate source data files together to create a linked analysis file.

Depending on the complexity of the source data file(s), decisions about how to structure the linked analysis file may be needed (e.g. working with multiple reference periods or with event-based files, etc.). Furthermore, the quality of the linked data must be assessed. Data that are linked in the SDLE will go through two kinds of validation:

  • Assessment of the record linkage: What is the match rate (%) with the Definitions: Derived record depositoryDRD? Are the links valid? (False positive links? Missed links?)
  • Assessment of linked analysis file: Do the linked data appear to make sense from a subject-matter point of view? Any bias caused by the linkage process? Do they adequately represent the study population of interest?

These file structuring decisions and data quality measures will be documented and need to be taken into account in the final analysis.

Services

In addition to maintaining the SDLE and conducting new record linkages, the SDLE team provides support to clients as required including:

  • assessing project feasibility;
  • advising on data sources, analytical limitations, and validation;
  • liaising with subject-matter experts;
  • assistance with approval steps;
  • building custom linked analysis files; and
  • providing training and outreach.

Statistics Canada makes custom services, such as the SDLE, available to Canadian organizations on a cost-recovery basis. Cost-recovery means that clients pay for the direct and indirect cost of doing the work. Custom services are not funded by the budget that Parliament allocates to Statistics Canada. Costs reflect the requirements of each client and range depending on the complexity of the proposal.

For more information, contact us by email at STATCAN.SDLE-ECDS.STATCAN@canada.ca.

Confidentiality and privacy

Linked analysis files are deemed sensitive statistical information and subject to the confidentiality requirements of the Statistics Act. To reduce the risk of privacy intrusiveness and to minimize the risk of disclosure, source files in SDLE are separated into source index files and source data files. As well, the record linkage production environment that uses the source index files is separated from the data integration and analysis environment that uses the source data files. That is, Statistics Canada employees performing the record linkages in SDLE have access to only the basic personal identifiers needed for linkage. Employees who build the analytical files for research have access only to the data stripped of personal identifiers. Anonymous keys are used to integrate the data from the various sources into a linked analysis data file. Further, only Statistics Canada employees who have an approved need to access the data for their analytical work are allowed access to the linked analysis file. The privacy impact assessment conducted by Statistics Canada found these processes acceptable to reduce the risk of privacy intrusiveness and to minimize the risk of disclosure.

Definitions

  1. Derived Record Depository (DRD) is a national longitudinal data base of individuals derived from a number of Statistics Canada data files and containing only basic personal identifiers.
  2. Key Registry stores the paired SDLE identifiers and source index file record IDs identified through record linkage.
  3. Source index files contain personal identifiers without analysis variables.
  4. Source data files contain analysis variables without personal identifiers.
Date modified: