Linking 2006 Census and hospital data in Canada
by Michelle Rotermann, Claudia Sanmartin, Richard Trudeau and Hélène St-Jean
Record linkage, the process of matching records across or within databases, is commonly used in health research to fill data gapsNote 1-7 and create a dataset with broad applications.Note 6-11 Most health-related linkages in Canada have relied on health insurance numbers (HINs) from provincial health registries, which are unique to individuals.Note 2,Note 3,Note 11,Note 12 However, HINs are not available in most databases (for instance, mortality, census, tax), and access to provincial registries is limited.
In the absence of a unique identifier and/or registry, an alternative approach— hierarchical deterministic exact matching—can be used to link health administrative databases and other data sources.Note 13 This involves matching different combinations of co-occurring person-level information.Note 14-18 Statistics Canada demonstrated the validity of this approach by linking census and hospitalization data in two provinces (Ontario and Manitoba).Note 19 Using birth date, sex and postal code to link files yielded results similar to those produced using HINs taken from the provincial health insurance regristries.Note 19
This study presents the results of a hierarchical exact matchingapproach to link the 2006 Census of Population with hospital data for all provinces and territories (excluding Quebec) to the 2006/2007-to-2008/2009 Discharge Abstract Database (DAD). The purpose is to determine if the Census–DAD linkage performed similarly in different jurisdictions, and if linkage and coverage rates declined as time passed since the census. The linkage was approved by Statistics Canada’s Policy Committee.Note 20 Use of the linked data is governed by the Directive on Record Linkage.Note 21
2006 Census of Population
The 2006 Census collected information using short- and long-form questionnaires. Most households (80%) received the short form, which contained eight basic questions, including the birth date, sex and marital status of all household members. The remaining households (20%) received the long form, which contained an additional 53 questions on topics such as education, ethnicity, mobility, income and employment.Note 22 In some regions, all households were asked to complete the long form: Nunavut, Northwest Territories (excluding Yellowknife), Yukon (excluding Whitehorse,) and other Indian Reserves and settlements.Note 22 The census represents 95% to 97% of the population in the provinces and 93% to 94% of the population in the territories.Note 23
For the purposes of record linkage, this complete census file (23.4 million), which contains both short- and long-form records was used. The long-form records (4.65 million) constitute the study cohort used for validation.
To make inferences about the Canadian population based on information from the long-form questionnaire, data are often weighted.Note 24 Sampling weights account for the survey design and the under- or overrepresentation of people with certain characteristics.Note 25 The census weights were not adjusted for linkage eligibility.
Quality standards for census collection and processing were rigorous.Note 22 As well, census information underwent quality verification, including comparisons with alternative data sources. Inconsistent or missing responses were imputed to ensure the internal consistency of information provided by each household. The overall imputation rate was 2.9%; imputation rates for “age” and “sex” were less than 1.5%.
The accuracy of address information, including postal codes, was central to the success of data collection.In areas where census forms were delivered by mail, Statistics Canada validated and updated address information before Census Day (May 16, 2006). In areas where enumerators delivered the forms, addresses were listed and verified at delivery.
T1 Personal Master Files
The T1 Personal Master File (T1PMF) is an annual file derived from tax returns. It contains name, date of birth, sex and postal code, which can be used for record linkage. To obtain postal code information for census respondents whose postal code was missing, incomplet, or had changed, T1PMFs for 2005 to 2009 were linked deterministically to the 2006 Census (short-form) file, based on sex, birth date, and partial family and first given names; income information was not retained. Approximately 90% of census records linked to at least one T1PMF year. For individuals who did not file taxes annually and/or were not required to file (for example, children), postal codes were assigned based on information about other household members who were tax-filers.
Discharge Abstract Database (DAD)
The DAD contains approximately 3 million hospital records with demographic, administrative and clinical data, as well as HINs, for all acute-care and some psychiatric, chronic rehabilitation and day-surgery hospital discharges occurring in a fiscal year (April 1 to March 31) in all provinces and territories except Quebec.Note 26,Note 27 Re-abstraction studies that compare information on the original DAD record with corresponding fields in patient charts repeatedly find that the non-clinical data elements, including date of birth, sex and postal code, are very reliable, with differences observed in these fields amounting to less than 2%.Note 27
For the purposes of record linkage, DAD data from 2005/2006 to 2008/2009 were used in the pre-processing phase of this study; data from 2006/2007 to 2008/2009 were used in the data linkage phase (2006/2007 n = 3,186,079; 2007/2008 n = 3,204,838; 2008/2009 n = 3,232,396). Because records pertain to hospitalizations, not to people, individuals who were hospitalized several times can be represented more than once in the DAD.
Record linkage involved three steps: data processing, record linkage and validation.
Before linkage, the data were processed to improve the quality of the variables that would be used (date of birth, postal code and sex) and to establish the unique set of linkage keys in each data file to minimize false links. This processing identified data errors or omissions that could result in false links.
A total of 23,397,153 census records were available for linkageNote 28 (Figure 1). Quebec records were excluded because Statistics Canada does not have access to the corresponding hospitalization data. Census records with an invalid or incomplete birth date were excluded. Records with missing information for sex were assigned a sex, and a duplicate record was created with the same birth date and postal code, but the opposite sex. Records with differing postal code information in the original and post-processed fields were also duplicated: one record with the original postal code; the other with the post-processed code. Original and duplicate census records were associated using a census group identifier, enabling the identification and removal of the duplicates once the linkage was completed. Lastly, a series of exclusions was applied to the census records to establish the final set of valid and unique linkage keys. The following keys were excluded: duplicate keys with the same date of birth, postal code and sex (for example, same-sex twins living in the same location); invalid linkage keys identified as a result of DAD data processing (described below); and keys with birthdates after May 16, 2006 (Census Day). In total, 23,369,308 valid and unique census keys were in-scope for data linkage, representing 96% of census respondents (excluding Quebec).
DAD records for fiscal years 2005/2006 to 2008/2009 pertaining to Canadian residents hospitalized in provinces and territories outside Quebec were eligible for pre-processing (12,824,006) (Figure 2). Records with invalid or missing birth date, sex or postal code were excluded. The remaining DAD records underwent additional processing to improve consistency and accuracy of HINs and to ensure a one-to-one correspondence with the linkage keys. A common adjustment was the replacement of a temporary (typically the mother’s) or missing HIN on an infant’s DAD record with a permanent HIN that had subsequently been assigned and that appeared on subsequent hospitalization records.
More than 12.7 million DAD records with valid linking information were available, representing 7,686,518 unique and valid linkage keys (Figure 2). The large number of duplicate linkage keys was expected, given that individuals could be admitted to hospital more than once. The following keys were excluded: linkage keys associated with multiple HINs in the same province; invalid keys identified as a result of census data processing described above; and keys with birthdates after Census Day. In total, 6,172,706 valid and unique linkage keys for fiscal years 2006/2007 to 2008/2009 were in-scope for data linkage.
The hierarchical deterministic exact matching approach to link census and DAD data involved an iterative process in which linkage keys comprised of date of birth, sex and postal code were compared across files. The use of multiple keys applied consecutively maximizes the discriminating power of the linking information and minimizes the impact of errors and missing values.Note 13
The iterative approach applied 28 rules. Early iterations observed stringent rules; subsequent passes tolerated some divergence (Table 1). For example, the first iteration required an exact match between the census birth date, sex and postal code and the DAD linkage keys. Iterations 2 to 4 required an exact match on birth date, sex and postal code obtained from the T1PMF and the DAD linkage keys. Postal codes from the 2005, 2006 and 2007 tax files were used when the census file was linked to the 2006/2007 DAD data. The 2006-to-2008 and 2007-to-2009 tax files were used when the census was linked to the 2007/2008 and 2008/2009 DAD data, respectively. Iterations 5 to 10 relaxed the rules for postal code, allowing one of the six characters of the census-reported postal code to be dropped. The process was repeated using the T1PMF postal codes (iterations 11 to 28).
At each iteration, only unique linkage keys from one file were compared to the unique keys in the other. When the iteration was completed, linked keys were removed from future iterations to ensure that keys were linked only once. Linked keys that had been added to deal with missing sex and/or postal code and sharing the same census group identifier as the linked keys were also removed. For census keys linking to a DAD key, a deterministic linkage to the full set of DAD discharge records was conducted using the linkage key, related HIN, and issuing province. Data processing and record linkage were conducted using SAS 9.2.
Two types of linkage rates are reported. First, the percentage of DAD keys linking to census keys is reported for each iteration of the linkage. The final linkage rate is not expected to be 100%, owing to differences in the populations represented in the census and DAD, census undercoverage of specific subpopulations, and use of hospital services by individuals who entered Canada after Census Day (for example, new immigrants). Lower linkage rates are also expected among people who were institutionalized (for example, residents of long-term care facilities), since residents would share the same postal code, thereby reducing the uniqueness of the linkage keys.
Second, the percentage of census records eligible for linkage that linked to the DAD (2006/2007 to 2008/2009) is reported by province/territory and by selected socio-demographic characteristics. These rates were based on long-form census respondents (the validation cohort) who were eligible for record linkage (4,652,683 records; 94% of all long-form respondents) and reflect the prevalence of being hospitalized at least once in the respective fiscal year. Linkage rates were expected to be higher among groups, such as seniors, who are more likely to be hospitalized than non-seniors. Variation in linkage rates across provinces/territories may also reflect differences in health care delivery.Note 29
When only a fraction of records are expected to link, evaluating the quality of a data linkage is challenging. This is typical of health-based data linkages, where, for example, limited numbers of individuals are expected to be hospitalized or to die during the follow-up period. In such situations, it is not obvious if unlinked records represent missed links or if the event of interest did not occur. The quality of linkages in this context has been assessed by comparing results of different approaches that used the same dataNote 16,Note 17,Note 30-33 and by comparing outcome rates and percentage distributions of variables available in linked and unlinked data.Note 30,Note 34,Note 35
Annual national (excluding Quebec) and jurisdiction-specific unweighted and weighted coverage rates were calculated by dividing the number of acute-care hospital discharges among long-form census respondents in each jurisdiction according to the linked census-DAD data (numerator) by the number of acute-care hospital discharges reported in the unlinked 2006/2007, 2007/2008 and 2008/2009 DAD data (denominator). To more closely match the target population of the linked data, where possible, DAD records pertaining to populations not captured by the long-form census were removed from the denominator: residents of seniors’ homes, people born after Census Day, stillbirths, and non-Canadians.
Unweighted coverage rates should approach the percentage of the population completing the long-form census (about 20% nationally, varying from 16% for Newfoundland and Labrador to 63% to 69% for Nunavut).Note 22,Note 24 Weighted coverage rates should approach, but not equal, 100%, owing to differences in the populations covered by the linked census-DAD data and the unlinked DAD data. For example, the institutionalized population, who are high users of hospital services,Note 36,Note 37 are represented in the unlinked DAD data, but not in the linked census-DAD data.
Linkage rates were compared by selected socioeconomic characteristics from the census to determine if rates were higher among people more likely to be hospitalized, such as those in lower-income groups and Aboriginal people.Note 36,Note 38,Note 39
Income quintiles were derived at the economic family level or directly for unattached individuals.Note 23, Total after-tax income from all sources and from all family members/individuals was summed, adjusted for family size, and divided into quintiles.To minimize regional income differences, quintiles were estimated separately for each province/territory, and then pooled.
Highest level of education of people aged 18 or older was dichotomized as: at least secondary school graduation, or less than secondary graduation. People younger than 18, most of whom would be too young to have graduated, were excluded.
Information on Aboriginal status was derived from the question: “Is this person an Aboriginal person, that is, North American Indian, Métis or Inuit (Eskimo)?” Respondents marked all that applied. Responses were grouped into six categories: North American Indian (only), Métis (only), Inuit (only), other Aboriginal (multiple or indeterminate), Aboriginal (composite of four preceding categories), or non-Aboriginal.
Country of birth, citizenship, and immigration status were combined into an immigrant status variable: immigrant, non-immigrant or non-permanent resident. Immigrants were further subdivided into long-term (arrived at least 10 years before the 2006 Census) and recent (arrived in the 9 years before the 2006 Census).
A one-year residential mobility variable was created to reflect address changes: same address, moved within Canada, or moved from outside Canada. This was derived by comparing each respondent’s municipality and province of residence on Census Day and one year earlier.
A rural/urban variable reflected residence location and community size. Farm residences and non-farm residences in areas with a population of less than 1,000 were considered rural/farm. Population centres were categorized as small (1,000 to 29,999), medium (30,000 to 99,999) and large (100,000 or more).
Respecting respondent privacy
Statistics Canada ensures respondent privacy during the linkage and subsequent use of linked files. Only employees directly involved in the process have access to the unique identifying information required for linkage (such as names and HINs) and do not access health-related information. When the data linkage is completed, an analytical file is created from which identifying information is removed. This de-identified file is accessed by analysts for validation and analysis.
Overall, 80%, or 1.66 million of eligible 2006/2007 DAD keys, were linked to the 2006 Census (Table 1). Results were similar for the eligible 2007/2008 (78% or 1.60 million) and 2008/2009 (77% or 1.57 million) DAD keys. The majority of links between the DAD and the 2006 Census ( 72% for 2006/2007 DAD to 59% for 2008/2009 DAD, or 1.50 to 1.22 million, respectively) were achieved in the first iteration, which required an exact match on birth date, sex and postal code. The number of links achieved using postal codes from tax files (iterations 2 to 4) ranged from 79,000 (4%) with the 2006/2007 DAD to 265,000 (13%) with the 2007/2008 DAD. Iterations 5 to 28 added an additional 85,000 to 88,000 (4%) links, depending on the DAD year.
The percentage of DAD keys linking to census keys tended to be consistent across provinces and age groups, but some exceptions were evident (Table 2). For example, the percentages of DAD keys that linked were comparatively low in Alberta (77% with 2006/2007 DAD to 73% with 2008/2009), British Columbia (78% to 75%) and the territories (73% to 72%). Lower rates were also observed for infants younger than age 1 (73% to 68%) and for 15- to 24-year-olds (70% to 61%).
The percentage of long-form census respondents who linked to the DAD (that is, they were hospitalized) ranged from 5.6% (2006/2007) to 5.2% (2008/2009) (Table 3). Linkage rates reflected expected differential use of hospital services. The rate was higher among females than males. Infants younger than 1 on Census Day (May 16, 2006) and seniors were more likely than other age groups to link to 2006/2007 DAD records. In subsequent DAD years, seniors’ hospital use remained comparatively high, but the rate among children who had been younger than age 1 on Census Day fell to that of children who had been aged 1 to 4 on Census Day. Other groups with higher linkage rates were those in low-income quintiles (6%) and those who identified as Aboriginal (7%). Linkage rates tended to be higher among rural than among urban populations.
Coverage rates for all-cause hospitalizations for 2006/2007 to 2008/2009 were 17% (unweighted) and 80% to 78% (weighted) (Table 4), but varied by jurisdiction. For example, unweighted coverage rates ranged from 16% to 22% in the provinces, and from 29% to 69% in the territories. Weighted coverage rates ranged from 75% to 84% in the provinces, and from 62% to 72% in the territories.
Over the study period, unweighted and weighted coverage rates were similar by sex, but not by age group. The weighted 2006/2007 rates for infants younger than age 1 and 15- to 24-year-olds were 6 to 10 percentage points below the all-ages total. Weighted coverage rates based on the 2007/2008 and 2008/2009 linked files also reflected this pattern, but because age was defined by the census rather than by hospitalization, undercoverage of youth in the later DAD files was apparent in the next-oldest age groups.
Based on a hierarchical deterministic exact matching approach, about 80% of linkage keys identified in the hospitalization data were linked to the 2006 Census. This was similar to several other Canadian studies, which reported match rates of 75% among records expected to link.Note 5,Note 40,Note 41
The hierarchical approach identified matches that would have been missed by a deterministic exact match approach conducted in a single pass.Note 13, The majority of links in the present study were made in the first iteration; an additional 8% to 17% were made in subsequent iterations. The use of updated postal codes from tax data to account for mobility contributed to achieving additional links to hospital data, particularly for later years, and overcame a limitation typical of most census-follow-up studies.Note 42 Linkage rates, coverage and quality of the data linkage remained consistent throughout the three-year study period.
Linkage rates to the DAD among the census long-form cohort were 5% to 6%, representing the percentage of census respondents who experienced at least one hospitalization during the three years. Linkage rates were higher among specific groups: seniors, people in low income quintiles, and Aboriginal people. This is consistent with previous research,Note 36,Note 38,Note 39 and provides evidence of the validity of the linkage and the suitability of the linked data for health analysis.
Coverage analysis revealed that the linked census-DAD file represents the majority of hospital events that occurred during the period (weighted results). Furthermore, the pattern of hospitalizations by patient characteristics was similar to results from the DAD alone, with some exceptions.
Unweighted coverage rates varied geographically, with higher rates in the territories, Manitoba and Saskatchewan. This is attributable to the sampling strategy of the census, whereby up to 100% of individuals in remote areas and on Indian Reserves and Settlements were asked to complete the long-form questionnaire. When weights were applied, Manitoba and Saskatchewan’s coverage rates were closer to rates for the other provinces. However, weighted rates for the territories were lower than those for the provinces; linked data may underestimate hospitalizations of territorial residents because of census undercoverage, higher rates of mobility, and/or a tendency to be hospitalized outside their jurisdiction of residence.Note 22,Note 24
As expected, the analysis revealed potential undercoverage of hospitalizations of specific age groups. For infants younger than age 1, lower linkage and coverage rates may be related to the availability of HINs on hospital birth records.Note 43 Despite attempts to correct this during data processing, some likely remained unresolved. Coverage rates were also lower among 15- to 24-year-olds, possibly reflecting census undercoverage of populations with less stable living arrangements and/or incomplete coverage of some Aboriginal populations.Note 22,Note 44 Because some of these populations have relatively high hospitalization rates, missed links may have a greater impact on coverage than the absolute numbers would suggest.Note 38
Same-sex twins residing at the same address on Census Day and same-sex twins hospitalized with the same postal code would have been dropped from the linked data file because of their non-unique linkage keys. However, twins and higher-order births represent about 3% of all births annually, and same-sex higher-order births are even rarer.Note 45
The linked census-DAD file has a number of important limitations.
First, to derive a set of unique linkage keys, removal of specific keys from both the census and the DAD was necessary. In census short forms, 97.6% of keys were retained; in census long forms, about 94% of keys were retained. The lower eligibility rate among long-form census respondents was most evident among those with lower socioeconomic status, people identifying as Aboriginal, rural/farm residents, and residents of Nunavut and British Columbia. Factors such as inaccurate or incomplete recording of dates of birth and/or other demographic details contribute to this situation.Note 5,Note 15,Note 34,Note 46
Second, the coverage analysis compares hospitalizations identified in the census-DAD linked data with those in the DAD data alone. However, the underlying populations differ. While attempts were made to remove DAD records for people not represented in the census (long-form), some may have been missed.
Finally, Quebec was excluded from this study because hospital data for this province are not available to Statistics Canada. In addition, hospitalizations of non-Quebec residents that occurred in Quebec would not be captured in the linked data.
The nationally representative sample and the statistical power provided by the size and population coverage of the linked census-DAD file offers new opportunities for research. Analysis of annual linkage and coverage rates suggests that the file’s completeness and quality remained consistent over time. Investigators who use these linked data should consider the potential impact of the linkage methodology, differences in linkage eligibility, linkage and coverage rates, and population exclusions.
- Date modified: