Data quality, concepts and methodology: Data quality

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Next | Previous

Sampling errors

Sampling errors occur because inferences about the entire population are based on information obtained from only a sample of the population. The sample design, the variability of the data, and the sample size determine the size of the sampling error. In addition, for a given sample design, different methods of estimation will result in different sampling errors.

The design for the 2009 Survey of Household Spending was a stratified multi-stage sampling scheme. The sampling errors for multi-stage sampling are usually higher than for a simple random sample of the same size. However, the operational advantages outweigh this disadvantage, and the fact that the sample is also stratified improves the precision of estimates.

Data variability is the difference between members of the population with respect to spending on a specific item or the presence of a specific dwelling characteristic or piece of household equipment. In general, the greater these differences are, the larger the sampling error will be. In addition, the larger the sample size, the smaller the sampling error.

Standard error and coefficient of variation

A common measure of sampling error is the standard error (SE). Standard error is the degree of variation in the estimates as a result of selecting one particular sample rather than another of the same size and design. It has been shown that the "true" value of the characteristic of interest lies within a range of +/- 1 standard error of the estimate for 68% of all samples, and +/- 2 standard errors for 95% of all samples.

The coefficient of variation (CV) is the standard error expressed as a percentage of the estimate. It is used to indicate the degree of uncertainty associated with an estimate. For example, if the estimate of the number of households having a given dwelling characteristic is 10,000 households, and the corresponding CV is 5%, then the true value is between 9,500 and 10,500 households, 68% of the time and between 9,000 and 11,000 households, 95% of the time.

Standard errors for the 2009 Survey of Household Spending were estimated using the "bootstrap" method. This method is suitable for variance estimation of non-smooth statistics such as quintiles. For more information on standard errors and coefficients of variation, refer to the Statistics Canada publication, Methodology of the Canadian Labour Force Survey, Catalogue no. 71-526-X.

Users should note that the variance of the estimates for the 2009 survey is comparable to 2008 which is higher than in 2007. Hence the CV's are generally larger than for years prior to 2008 due to a reduced sample size in comparison to those years.

Data suppression

For reliability reasons, estimates with CVs greater than 33% are suppressed. Since CVs are not calculated for all estimates, data suppression for the Survey of Household Spending is based on a relationship between the CV and the number of households reporting expenditure on an item. Analysis of past survey results indicates that CVs usually reach 33% when the number of households reporting an item drops to about 30. Therefore, data are suppressed for spending on items reported by fewer than 30 households.

However, data for suppressed items do contribute to summary level variables. For example, the expenditure for a particular category of clothing might be suppressed but this amount forms part of the total estimate for clothing expenditure.

Because of a reduced sample size, there are more cells suppressed than in 2007 and previous years, particularly for smaller areas such as metropolitan areas. Data for Quebec City, Ottawa and Victoria are suppressed for this reason in 2009.

Non-Sampling errors

Non-sampling errors occur because certain factors make it difficult to obtain accurate responses or responses that retain their accuracy throughout processing. Unlike sampling error, non-sampling error is not readily quantified. Four sources of non-sampling error can be identified: coverage error, response error, non-response error, and processing error.

Coverage error

Coverage error results from inadequate representation of the intended population. This error may occur during sample design or selection, or during data collection and processing.

Response error

Response error may be due to many factors, including faulty design of the questionnaire, interviewers' or respondents' misinterpretation of questions, or respondents' faulty reporting.

Several features of the survey help respondents recall their expenditures as accurately as possible. First, the survey period is the calendar year because it is probably more clearly defined in people's minds than any other period of similar length. Second, expenditure on food can be estimated as either weekly or monthly expense depending on the respondent's purchasing habits. Third, expenses on smaller items purchased at regular intervals are usually estimated on the basis of amount and frequency of purchase. Purchases of large items (automobiles, for example) are recalled fairly easily, as are expenditures on rent, property taxes, and monthly payments on mortgages. However, even with these items, the accuracy of data depends on the respondent's ability to remember and willingness to consult records.

In the Survey of Household Spending, the difference between receipts and disbursements is calculated as a check on respondents' recall. This important quality control tool involves the balancing of receipts (income and other money received by the household) and disbursements (total expenditure plus the variable Money flows—assets, loans, and other debts) for each questionnaire. If the difference is greater than 30% of the larger of receipts or disbursements, the record is considered unusable and therefore will not be used.

In 2007, in order to reduce respondent's burden, new screening questions were added to the questionnaire for some categories. This first series of screening questions were ambiguous and required changes. The changes made in the 2008 survey seemed to resolve the problem so the questions remained the same in 2009.

Users should therefore be aware that for some expense categories, ¹ decreases between 2006 and 2007 and increases between 2007 and 2008 or 2007 and 2009 are likely the result of the way the question was asked in 2007. Therefore, these changes should be discounted. The exception appears to be "Maps", where the decrease has persisted and may represent a real change in purchase patterns because of new GPS technology.

Non-response error

Non-response error occurs in sample surveys because not all potential respondents provide complete information.

Total non-response occurs when the interviewer is unable to contact the respondent, no member of the household is able to provide information, or the respondent refuses to participate in the survey. Total non-response is handled by adjusting the basic survey weight for responding households to compensate for non-responding households. For the 2009 Survey of Household Spending, the overall response rate of usable questionnaires is 64.5%. See "Table 1" for provincial and territorial response rates.

In most cases, partial non-response occurs when the respondent does not understand or misinterprets a question, refuses to answer a question, or is unable to recall the requested information. Imputing missing values compensates for this partial non-response.

The importance of the non-response error is unknown but in general this error is significant when a group of people with particular common characteristics refuse to cooperate and where those characteristics are important determinants of survey results.

Text table 1

Response rates, 2009

Processing error

Processing errors may occur in any of the data processing stages, for example, during data entry, editing, weighting, and tabulation. See "Data processing and quality control" for a description of the steps taken to reduce processing error.

The effect of large values

For any sample, estimates can be affected by the presence or absence of extreme values from the population. These extreme values are most likely to arise from positively skewed populations. The nature of the subject matter of the SHS lends itself to such extreme values. Estimates of totals, averages and standard errors may be greatly influenced by the presence or absence of these extremes.

Comparability over time

Conducted since 1997, the Survey of Household Spending integrates most of the content found in the Family Expenditure Survey and the Household Facilities and Equipment Survey. Many variables from these two surveys are comparable to those in the Survey of Household Spending. However, some differences related to the methodology, to data quality and to definitions must be considered before making comparisons.

For more information, refer to Note to Former Users of Data from the Family Expenditure Survey and Note to Former Users of Data from the Household Facilities and Equipment Survey, Catalogue no. 62F0026M. Both documents are available free of charge on the Statistics Canada web site (www.statcan.gc.ca).

Historical data from the 1997 to the 2003 surveys of household spending have been re-weighted using the weighting methodology described in the section "Weighting". Historical comparisons between data from those years and data from recent years of the Survey of Household Spending should generally be made with re-weighted data, although the differences between survey estimates from the old and new methodologies appear to be minimal at a summary level. Certain populations or variables, however, may be more strongly affected.

Starting with the 1997 Survey of Household Spending, "Tenant's maintenance, repair and alterations" and "Insurance premiums" were reduced by the proportion of rent charged to business. This may affect comparisons with data from previous years.

For the 2001 and 2005 reference years, extra questions were included for use in the weighting of the Consumer Price Index. This change may affect some historical comparisons. For example, in 2001 and 2005, questions were added under "Personal care" to collect extra information about hair care products, makeup, fragrances, deodorants and oral hygiene products. As a result of these extra questions, respondents may have given more precise information and the increase in the estimated expenditures for "Personal care" in those years may have been caused by an improvement in respondent recall. The effect of additional questions on estimates is difficult to quantify. However, in 2002, when the extra questions were removed, the estimate for personal care spending decreased again. For the 2006 SHS and subsequent years the extra questions of 2005 were retained.

The section of the questionnaire which covers "Repairs and improvements of owned principal residences" was extensively revised in 2004. From 1997 to 2003, this section had three broad questions: "Additions, renovations and other alterations"; "Replacement or new installation of built-in equipment, appliances and fixtures"; and "Repairs and maintenance". Since the 2004 Survey of Household Spending, the expenses for "Repairs and maintenance" and "Improvements and alteration" are reported separately for each category.

Beginning with the 2006 Survey, computer assisted personal interviews (CAPI) replaced the previous paper questionnaire. The household members, dwelling characteristics and household facilities and equipment are all as of the time of the interview, instead of as of December 31^st as in previous years. Household spending were collected for the reference year for all members of the household as of the time of the interview, eliminating the distinction between part-year and full-year members and households.

Since the CANSIM tables prior to 2006 were based on full-year households only, in order to maintain the comparability with prior years, the data for 1997-2005 have been revised to include both full-year and part-year.

Next | Previous