The 2011 National Household Survey—the complete statistical story
by Wayne R. Smith, Chief Statistician of Canada
I am frequently asked for Statistics Canada’s assessment of the data quality from the voluntary 2011 National Household Survey (NHS) that replaced the mandatory long-form census.
- What were the consequences of the move from a mandatory to a voluntary survey?
- Was there greater sampling error in the NHS?
- What is the quality of data at low levels of geography?
- Were Aboriginal peoples, recent immigrants or Canadians in low-income groups under-represented?
Unfortunately, the whole statistical story has not been fully conveyed. With this blog, I hope to provide a comprehensive statement on this topic.
Statistics Canada has always stated that a mandatory survey will inevitably produce data of better overall quality than a voluntary survey of the same size, all other things being equal. The 2011 National Household Survey achieved a collection response rate of 68.6% and a weighted response rate of 77.2%—significantly lower than the 2006 Census long form that achieved a response rate of 93.8%.
To offset the data quality risks associated with the move to a voluntary survey, the agency took many measures to reduce risks, and invested a great deal of effort in assessing and reporting on the quality of the resulting estimates. Where Statistics Canada deemed that estimates were not of sufficient quality (or not fit for use), it did not release them; where we deemed the estimates were to be used with caution, we communicated this to users. The remaining estimates—the vast majority—were deemed fit for use, and released without caveats. Based on this work, we can say that the National Household Survey produced a rich and robust database of information.
What were the challenges?
With the move from a mandatory approach to a voluntary one, we anticipated a significant reduction in survey response rates. We also knew that this reduction would bring with it three principal challenges: variability of response rates at lower geographic levels, sampling error and non-response bias.
Let me explain what Statistics Canada did to address these three issues.
Variability of response rates at lower levels of geography
Statistics Canada did not publish community-level data from the National Household Survey for approximately 1,100 communities, or 3% of the Canadian population, because of unacceptably low response rates by Statistics Canada’s standards, and this resulted in data quality risks. This compares with fewer than 160 communities whose data were not released as a result of data quality issues in 2006. The 2011 NHS data for these small communities remain available on request.
Mitigating the risk of sampling error
We knew that if the initial sample size for the 2011 National Household Survey remained the same as in 2006, an increase in sampling error would result. So, to mitigate this risk, Statistics Canada increased the sampling rate in the 2011 National Household Survey from one in five households, for the Census long form in 2006, to one in three households. The sample selection was random and based on a well-defined methodological design. The number of households responding to the voluntary 2011 National Household Survey was 2,657,461, containing 6,719,688 people. This was 9% higher than the number of households responding to the mandatory 2006 Census long form (2,443,507 households, containing 6,136,517 people).
Because the number of responding households was higher in 2011 than in 2006, this effectively prevented any material increase in sampling error at higher levels of aggregation. However, at lower levels of aggregation, the response rate and sampling error varied. For this reason, Statistics Canada calculated, and published on its website, coefficients of variation for the 2011 NHS (CVs, a measure of sampling error) for a selection of variables at various levels of geography, and included a comparison with coefficients of variations (CVs) for the same variables in the 2006 Census long form.
At the national level, for the nine variables for which a comparison was provided, the coefficient of variation for the 2011 NHS was lower for seven variables and higher for two, but, generally, the coefficients were very small and very similar in magnitude. Lower CVs generally reflect higher accuracy and data quality. For geographies with smaller populations, the CVs were higher and more varied. However, they were similar in magnitude to those of 2006 for the regions for which results were released, with 2011 coefficients sometimes being lower and sometimes higher.
It is important to remember, in comparing data from previous census long forms to the 2011 NHS, that the data from the previous census long forms were also based on a sample and therefore subject to sampling error.
Mitigating the risk of non-response bias
A major concern with the move to a voluntary survey was the potential, under lower response rates, for non-response bias. This occurs when some individuals in the population are less likely to respond than others.
To assess the potential of this risk, and to determine mitigation strategies to deal with this risk, Statistics Canada ran a simulation using 2006 Census long-form returns. This helped us to determine where bias was most likely to occur in a situation of reduced response. Some critics erroneously cited the agency’s study as proof that these risks would inevitably occur. However, they did not consider the mitigation strategies that Statistics Canada had implemented in response to this study. In fact, Statistics Canada used the results of the simulation to inform and guide the design of collection approaches, processing and estimation to ensure these risks were minimized.
To adjust for non-response bias, Statistics Canada deployed a unique, powerful process not available to other survey research organizations. For every NHS sampled dwelling, Statistics Canada had the corresponding 2011 Census record that gave basic demographic and language characteristics for each household member. Furthermore, the census was linked to tax files, immigrants’ landing files and the Indian Register, all of which provided additional information on both NHS respondents and non-respondents. This information was used to run estimation models for weighting areas with approximately 6,000 population to ensure that the NHS estimates reflected the true population profiles as precisely as possible. Statistics Canada was able to measure and confirm that this process improved the quality of estimates; however, we could not completely reduce some of the volatility inherent with estimates for smaller populations.
Finally, having generated the estimates from the NHS, Statistics Canada set out to validate them. Subject-matter specialists assessed the validity by comparing the NHS estimates with internal and external benchmarks. This information was used to assess the potential for bias in relation to the level of non-response, and guided the establishment of the quality thresholds for release. When quality issues were found, the estimates were either not released, or released with an accompanying cautionary note.
The comparison to external benchmarks was conducted at the lowest level of geographic detail, facilitated by the alternate data sources at hand. At finer levels of detail, the validation mostly relied on the execution of the sound processes described above, as well as the knowledge and experience of subject-matter specialists. The same approach has always been used to validate the census long-form survey estimates at finer levels of geographic detail.
2011 National Household Survey—dispelling the myths
The results of this validation process, with some specific exceptions, confirmed the good quality of the estimates, and provided evidence that some of the concerns prior to the collection of the NHS had not materialized. The more common concerns expressed related to the potential underestimation of three important population groups: new immigrants to Canada, Aboriginal peoples and Canadians in low-income groups. Since the publication of the NHS results, many continue to claim that these undercounts occurred. The published facts do not support all of these claims. Let’s take them one by one.
We know the actual number of immigrants admitted to Canada between censuses from Citizenship and Immigration Canada administrative records. In evaluating our data, Statistics Canada routinely compares this number to the estimated number of recent immigrants based on NHS data. The comparison of NHS estimates to this reliable source showed that recent immigrants were no more underestimated in the NHS than in previous censuses.
For Aboriginal peoples, the risk of non-response bias is minimal for the Aboriginal population living on reserves and in the territories, given that they are relatively homogeneous populations (with respect to Aboriginal identity) and that they were weighted independently from the population living off reserve in the provinces. For Inuit and off-reserve Aboriginal peoples, we made use at the microdata level of two independent data sources: the Indian Register, which covers the registered Indian population of Canada, and the 2006 Census. We also took advantage of the language variables on the 2011 Census.
The certification analyses, as highlighted in our published reference guide, revealed very little evidence of bias for the estimates of the Aboriginal population off reserve, but indicated that the population of Inuit living in the provinces may have been overestimated.
The 2011 NHS provides valuable information about the composition, characteristics and distribution of income received by Canadians. As with all other data from the NHS, the quality of the 2011 NHS income estimates was evaluated prior to publication. The low-income rate was one indicator that showed different trends when compared with previous censuses and other income sources. The results of Statistics Canada’s data evaluation highlighted possible overestimation of the prevalence of low income.
Statistics Canada, therefore, cautioned users that low-income results from the NHS should not be compared with those from the earlier censuses. Low-income rates from the NHS can be used to identify which groups are at higher risk of poverty, which is important information for policy development. With its large sample size, the NHS can also be used to estimate low-income rates at the provincial, sub-provincial and CMA level, allowing users to make comparisons across Canada for particular subgroups.
2011 Census results, lessons learned and going forward
Where does all this leave us? While the 2011 National Household Survey does not entirely rise to the data quality of the 2006 Census long form, the estimates that have been published are, nonetheless, robust and entirely usable. Where the data quality was judged insufficient, it was not published. Where issues of potential bias or lack of comparability with previous censuses were identified, they were communicated at the time of release of the information.
Statistics Canada recognizes that there could be flaws in the NHS that have not yet been identified. As Chief Statistician of Canada, I welcome those who have discovered such issues to share them, along with your evidence. The science of statistics is dynamic. Learning more about these issues and questions will help us improve the census program going forward.
Please note that comments are moderated. It may take some time for your comments to appear online. For more information, consult our rules of engagement.