Sampling and non-sampling errors

Beyond the conceptual differences, many kinds of error can help explain differences in the output of the programs that generate data on income. They are often classified into two broad types: sampling errors and non-sampling errors.

Sampling errors occur because inferences about the entire population are based on information obtained from only a sample of that population. Because SLID and the long-form Census are sample surveys, their estimates are subject to this type of error. The coefficient of variationis a measure of the extent to which the estimate could vary, if a different sample had been used. This measure gives an indication of the confidence that can be placed in a particular estimate. This data quality measure will be used later in this paper to help explain why some of SLID's estimates, which are based on a smaller sample, might differ from those of the other programs generating income data.  While the Census is also subject to this type of error, reliable estimates can be made for much smaller populations because the sampling rate is much higher for the Census (20%)1.

Non-sampling errors can be further divided into coverage errors, measurement errors (respondent, interviewer, questionnaire, collection method…), non-response errors and processing errors. The coverage errors are generally not well measured for income and are usually inferred from exercises of data confrontation such as this. Section 3 will review the population exclusions and other known coverage differences between the sources.

The issues of various collection methods or mixed response modes and the different types of measurement errors that could arise will be approached in section 4.

Non-response can be an issue in the case of surveys. It is not always possible to contact and convince household members to respond to a survey. Sometimes as well, even if the household responded, there may not be valid responses to all questions. In both cases adjustments are performed to the data but error may result as the quality of the adjustments often depends on then on-respondents being similar to the respondents. For the 2005 income year, SLID had a response rate of 73.3% and for the Census, it was close to 97%. Still for 2005, because of item non-response, all income components were imputed for 2.7% of SLID's respondents and at least some components were imputed for another 23.5%2. In the case of the Census, income was totally imputed for 9.3% and partially imputed for 29.3%.

In administrative data – in particular the personal tax returns – the filing rates for specific populations may depend on a variety of factors (amount owed, financial activity during the year, personal interest, requirement for eligibility to support programs, etc.) and this could also result in differences in the estimates generated by the programs producing income data.

The systems and procedures used to process the data in each of the programs are different and may have design variations that impact the data in special ways. When such discrepancies have been identified, they will be mentioned in section 5. Beyond the design variations, most processing errors in these data sources are thought to be detected and corrected before the release of data to the public. However due to the complexity and to the yearly modifications of processing systems, some errors may remain undetected and they are therefore quite difficult to quantify.

More detail on the quality and methods of individual statistical programs is accessible through the Surveys and statistical programs by subject section on Statistics Canada's website.


Notes

  1. The sampling error from one-year estimates of individual income based on the LAD would also be of a similar magnitude as its sampling rate is also one in five.
  2. Data Quality in the 2005 Survey of Labour and Income Dynamics , C. Duddek, Income Research Paper Series, Statistics Canada catalogue no. 75F0002-No.003, May 2007.