Section 7: Data quality

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Next section | Previous section

Non-sampling errors

Errors that are not related to sampling may occur at almost every phase of a survey operation. Interviewers may misunderstand instructions, respondents may make errors in answering questions, the answers may be incorrectly entered, and errors may be introduced in the processing and tabulation of the data. These are all examples of non-sampling errors.

Over a large number of observations, randomly occurring errors will have little effect on estimates derived from the survey. However, errors occurring systematically will contribute to biases in the survey estimates. Quality assurance measures are implemented at each step of the data collection and processing cycle to monitor the quality of the data. These measures include the use of highly skilled interviewers, extensive training of interviewers with respect to the survey procedures and questionnaire, observation of interviewers to detect problems of questionnaire design or misunderstanding of instructions, edits to ensure data entry errors are minimized, and coding and edit quality checks to verify the processing logic.

Sampling errors

The Labour Force Survey collects information from a sample of households. Somewhat different figures might have been obtained if a complete census had been taken using the same questionnaires, interviewers, supervisors, processing methods, etc. The difference between the estimates obtained from a sample and those that would give a complete count taken under similar conditions is called the sampling error, the precision of the estimate or sampling variability.

Three related methods can be used to interpret and evaluate sampling error or the precision of the estimates: the standard error and two other methods also based on standard error; coefficients of variation and confidence intervals. These methods can be used to conduct hypothesis tests.

Approximate measures of sampling error accompany Labour Force Survey products and users are urged to make use of them while analyzing the data. All seasonally adjusted CANSIM tables include standard errors for monthly estimates, month-to-month estimates of change, and year-to-year estimates of change. An economic region table includes the standard error of the estimate and of the year-over-year change. Tables 7.1, 7.2, and 7.3 can be used to obtain approximate CVs for most other released estimates.

With the release of occupation estimates based on the 2011 National Occupation Classification (NOC 2011) in January 2016, CVs and standard errors are available for all CANSIM tables with occupation data upon request. For these measures or for measures for other series, please contact Statistics Canada's Statistical Information Service (toll-free 1-800-263-1136; international 1-514-283-8300; STATCAN.infostats-infostats.STATCAN@canada.ca).

Another option is to obtain direct access to Labour Force Survey data and bootstrap weights through the Research Data Centres (RDC). See Access to microdata in Section 9 for more information.

Interpretation using standard error

The standard error is a numerical measure of the sampling error that quantifies how different the estimates from all potential samples would be from one another, assuming the same sampling plan. On its own, it is a value that can be difficult to interpret but it is used in developing other more intuitive measures including coefficients of variation and confidence intervals. They are also useful in data analysis with hypothesis tests.

Although the concept of standard error is based on the idea of selecting several samples, in practice only one sample is drawn and the standard error is estimated based on the information collected from the units in that sample.

The standard error depends on the sample size, the response rate, the size of the population, the variability of the characteristic of interest in the population, and the sample design and estimation methods. Typically, of similar estimates, the one with larger sample size will yield the smaller standard error.

Interpretation using coefficient of variation

Coefficients of variation (CVs) are widely used in practice to report the sampling error of survey estimates. One feature of CVs is that they are a relative measure, meaning that the quality of estimates of varying sizes can be compared. To obtain the CV, the standard error is divided by the estimate.

Small CVs are desirable because they indicate that the sampling variability is small relative to the estimate. Since the CV is the standard error expressed as a percentage of the estimate, the smaller the estimate, the larger the CV (all other things being equal). For example, when the unemployment rate is high, the CV may be small. If the unemployment rate falls due to improved economic conditions, then the corresponding CV will become larger.

Interpretation using confidence intervals

Confidence intervals provide another way of looking at the variability inherent in estimates of sample surveys. A confidence interval is a range of values that has a probability, known as the level of confidence, of containing the actual value. In other words, a 95% confidence interval means that if a large number of samples were drawn and a confidence interval was constructed for each sample, 95% of the constructed confidence intervals should contain the actual value. To illustrate how to calculate the confidence interval, let us say that one month the published estimate for total employment rose by 60,000 and the associated standard error for the movement estimate is 25,000. We can say that:

A 95% confidence interval can be constructed by adding and subtracting 50,000 (two standard errors) from 60,000. This means that there are approximately nineteen chances in twenty (95%) that the range (10,000 to 110,000) contains the real value of the change between the two months.
If one standard error (25,000) is added and subtracted from 60,000, a 68% confidence interval is constructed. This means that there are approximately two chances in three (68%) that the range (35,000 to 85,000) contains the real value of the change between the two months.

Conducting hypothesis tests

Standard errors may also be used to perform hypothesis testing, a procedure for distinguishing between population parameters using sample estimates. The larger the observed change between two estimates relative to its standard error, the better the chance that we are observing a real change, as opposed to a change due to sampling variability.

One simple way to conduct a hypothesis test is with a confidence interval. If the 95% confidence interval of an observed estimate of change does not contain zero then the change is considered statistically significant at the 5% level of significance. The level of significance is the probability of concluding that there is a change, when in fact the actual change is zero. If the confidence interval of the estimate does contain zero, it is less likely to reflect a real change and more likely to be due to sampling variability.

To illustrate, let us say that between two months, the published estimate for total employment is an increase of 60,000 and the associated standard error for the movement estimate is 25,000. Since the 95% confidence interval (10,000 to 110,000) does not contain zero, this change in employment is considered significant at the 5% level of significance.

Using approximate sampling variability tables

In practice, standard errors are not provided with all published estimates so approximate CV tables are provided to allow users to obtain CVs. Standard errors can be calculated from a CV by multiplying the CV by the estimate. Standard errors can then be used to produce confidence intervals and perform hypothesis tests, as described above.

Three tables are available: 7.1 for monthly totals for Canada and the provinces, 7.2 for annual averages for Canada and the provinces, and 7.3 for three-month moving averages and annual averages for the territories.

These tables are supplied as a rough guide to the sampling variability. The sampling variability is modeled so that, given an estimate, approximately 75% of the actual CVs will be less than or equal to the CVs derived from the table. There will, however, be 25% of the actual CVs that will be somewhat higher than the ones given in the table.

Table 7.1

Coefficient of variation (CV) for estimates of monthly totals, Canada and provinces

Table 7.2

Coefficient of variation (CV) for estimates of annual averages, Canada and provinces

Table 7.3

Coefficient of variation (CV) for estimates of three-month moving averages and annual averages, territories

Variability of monthly estimates for Canada and the provinces

To look up an approximate measure of the CV of an estimate of a monthly total, please consult Table 7.1, which gives the size of the estimate as a function of the geography and the CV. The rows give the geographic area of the estimate, while the columns indicate the resulting level of accuracy in terms of the CV, given the size of the estimate. To determine the CV for an estimate of size X in area A, look across the row for area A, find the first estimate that is less than or equal to X. The title of that column will give the approximate CV. For example, to determine the standard error for an estimate of 34.7 thousand unemployed in Newfoundland and Labrador in November 2015, we find the closest but smaller estimate of 27.2 thousand, giving a CV of 5%. Therefore, the estimate of 34,700 unemployed in Newfoundland and Labrador has a CV of roughly 5%.

The CV values given in Table 7.1 are derived from a model based on LFS sample data for the 48-month period from January 2012 through December 2015 inclusive. It is important to bear in mind that these values are approximations.

Table 7.1 can be used with either seasonally adjusted estimates, or with estimates that have not been seasonally adjusted. Studies have shown that LFS standard errors for seasonally adjusted data are close to those for unadjusted data, particularly when estimates are for larger populations and domains.

Variability of annual estimates for Canada and the provinces

To look up an approximate measure of the CV of an estimate of an annual average, please consult Table 7.2, which gives the size of the estimate as a function of the geography and the CV. The rows give the geographic level of the estimate, while the columns indicate the resulting level of accuracy in terms of the CV, given the size of the estimate. To determine the CV for an estimate of size X in area A, look across the row for area A, find the first estimate that is less than or equal to X. The title of that column will give the approximate CV. For example, to determine the standard error for an annual average estimate of 34.7 thousand unemployed in Newfoundland and Labrador in 2015, we find the closest but smaller estimate of 29.3 thousand, giving a CV of 2.5%. Therefore, the estimate of 34,700 unemployed in Newfoundland and Labrador has a CV of roughly 2.5%.

The CV values given in Table 7.2 are derived from a model based on LFS sample data for the 5-year period from 2011 to 2015. It is important to bear in mind that these values are approximations.

Sampling variability tables for the territories

The CV values for three-month moving averages given in Table 7.3 for the Yukon, Northwest Territories and Nunavut are derived from a model based on LFS sample data for the 48-month period of January 2012 through December 2015 inclusive. The CV values for annual averages given in the same table are derived from a model based on LFS sample data for the 5-year period of 2011 to 2015.

Variability of rates

Estimates that are rates and percentages are subject to sampling variability that is related to the variability of the numerator and the denominator of the ratio. The various rates given are treated differently because some of the denominators are calibrated figures that have no sampling variability associated with them.

Unemployment rate

The unemployment rate is the ratio of X, the total number of unemployed in a group, to Y, which is the total number of participants in the labour force in the same group. Here the group may be a province or CMA and/or it may be an age-sex group.

The CV for the unemployment rate can be estimated with the following formula:

[CV(X/Y)]² = [CV(X)]² + [CV(Y)]²– 2p[CV(X)] [CV(Y)]

where CV(X) would be the CV for the total number of unemployed in a specific geographic or demographic subgroup and CV(Y) would be the CV for the total number of participants in the labour force in the same subgroup. The correlation coefficient, denoted p, measures the amount of linear association between X and Y (respectively, the number of unemployed and the number of participants in the labour force in the same subgroup). The value of p ranges between -1 and 1. For example, a strong positive linear association would indicate that unemployment counts generally increase as the total number of participants in the labour force increases. Note that we can expect a larger CV for the unemployment rate when p is negative, since in this case, the third term on the right side of the equation above becomes positive.

When p is not available, the most conservative approach is to take p = -1, which leads to the simplified formula:

CV(X/Y) = CV(X) + CV(Y)

Note that this will likely lead to an overestimation of the CV(X/Y).

For example, in November 2015, there were 34,700 unemployed persons in Newfoundland and Labrador and 268,900 participants in the labour force, giving an unemployment rate of 12.9%. Table 7.1 gives the CVs for the two counts as 5.0% and 1.0% respectively. An approximation of the CV for the unemployment rate of 12.9% using the above formula would be:

5.0% + 1.0% = 6.0%

Note that, in the case of this particular estimate, the above approximation is only slightly above the 5.8% CV that is estimated using complex computer-intensive variance estimation methods.

Participation rate and employment rate

The participation rate represents the number of persons in the labour force expressed as a percentage of the total population size. The employment rate is the total number of employed divided by the total population size. For both the above rates, the numerator and the denominator represent the same geographic and demographic group.

For Canada, the provinces, CMAs and some age-sex groups, the LFS population estimates are not subject to sampling variability because they are calibrated to independent sources. Therefore, in the case of the participation rate and the employment rate of these geographic and demographic groups, the CV is equal to that of the contributing numerator.

Some subgroups of Canada such as industry and occupation groups are not calibrated to independent sources. For example, there is no official independent source for a monthly count of persons in the agriculture industry in Manitoba. To determine the CV of rates in the case of such subgroups, the variability of both the numerator and the denominator has to be taken into account because the denominator is no longer a controlled total and is subject to sampling variability. Therefore, for participation rates and employment rates of subgroups, the CV can be determined similar to the unemployment rate. The totals in the numerator and denominator for the relevant rate should reflect the same subgroup.

Variability of estimates of change

The difference of estimates from two time periods gives an estimate of change that is also subject to sampling variability. Users are typically interested in determining if this change is statistically significant or not. An estimate of year-to-year or month-to-month change is based on two samples which may have some households in common. Hence, the sampling variability of change depends on the sampling variability of the estimates for both periods and the correlation p, between the periods.

The value of p ranges between -1 and 1, with 1 being the perfect positive linear association. One can generally use the sample overlap to approximate the correlation coefficient as follows:

For the provinces: use p = 5/6 for month-to-month changes, and p = 0 for year-to-year changes.
Empirical studies at Statistics Canada have shown that for the provinces, a value of p equal to 5/6 is a good approximation for estimates of employment, but for estimates of unemployment, a p of 0.45 would yield a better approximation for month-to-month changes.

Typically, the CV of the estimate of change is not a useful measure for analysis but can be used to derive more useful statistics. As described in the sub-section entitled Conducting hypothesis tests, a hypothesis test can be conducted by using confidence intervals based on the standard error of the estimate. The standard error can be derived from the CV by multiplying the CV by the estimate of change (Y₂-Y₁). The CV for an estimate of change can be calculated from the CVs of the estimates from the two time periods using the following formula:

(1)

Description for image(6)

where Y₁ and Y₂ are the estimates for the two periods. The value of p is the correlation coefficient between Y₁ and Y₂.

Note: If the estimate of change (Y₂-Y₁) is negative then the calculated CV will be negative, but generally the equivalent positive value is reported.

When comparing the annual averages of two years, the CV of the annual estimates (Table 7.2) should be used. For month-to-month change, seasonally adjusted estimates should be used in conjunction with the CVs of the monthly estimates from Table 7.1. Note that the above formula gives an approximate estimate of the sampling variability associated with an estimate of change.

Guidelines on data reliability

Household surveys within Statistics Canada generally use the following guidelines and reliability categories in interpreting CV values for data accuracy and in the dissemination of statistical information.

Category 1 - If the CV is ≤ 16.5% - no release restrictions: data are of sufficient accuracy that no special warnings to users or other restrictions are required.

Category 2 - If the CV is > 16.5% and ≤ 33.3% - release with caveats: data are potentially useful for some purposes but should be accompanied by a warning to users regarding their accuracy.

Category 3 - If the CV > 33.3% - not recommended for release: data contain a level of error that makes them so potentially misleading that they should not be released in most circumstances. If users insist on inclusion of Category 3 data in a non-standard product, even after being advised of their accuracy, the data should be accompanied by a disclaimer. The user should acknowledge the warnings given and undertake not to disseminate, present or report the data, directly or indirectly, without this disclaimer.

Confidentiality release criteria

Statistics Canada is prohibited by law from releasing any data which would divulge information obtained under the Statistics Act that relates to any identifiable person, business or organization without the prior knowledge or the consent in writing of that person, business or organization. Various confidentiality rules are applied to all data that are released or published to prevent the publication or disclosure of any information deemed confidential. If necessary, data are suppressed to prevent direct or residual disclosure of identifiable data.

The LFS produces a wide range of outputs that contain estimates for various labour force characteristics. Most of these outputs are estimates in the form of tabular cross-classifications. Estimates are rounded to the nearest hundred and a series of suppression rules are used so that any estimate below a minimum level is not released.

The LFS suppresses estimates below the levels presented in the Table 7.4.

Table 7.4

Minimum size for release, Canada, provinces and territories