# Section 7: Data quality

## Non-sampling errors

Errors that are not related to sampling may occur at almost every phase of a survey operation. Interviewers may misunderstand instructions, respondents may make errors in answering questions, the answers may be incorrectly entered, and errors may be introduced in the processing and tabulation of the data. These are all examples of non-sampling errors.

Over a large number of observations, randomly occurring errors will have little effect on estimates derived from the survey. However, errors occurring systematically will contribute to biases in the survey estimates. Quality assurance measures are implemented at each step of the data collection and processing cycle to monitor the quality of the data. These measures include the use of highly skilled interviewers, extensive training of interviewers with respect to the survey procedures and questionnaire, observation of interviewers to detect problems of questionnaire design or misunderstanding of instructions, edits to ensure data entry errors are minimized, and coding and edit quality checks to verify the processing logic.

## Sampling errors

The Labour Force Survey collects information from a sample of households. Somewhat different figures might have been obtained if a complete census had been taken using the same questionnaires, interviewers, supervisors, processing methods, etc. The difference between the estimates obtained from the sample and those that would give a complete count taken under similar conditions is called the sampling error of the estimate, or sampling variability. Approximate measures of sampling error accompany Labour Force Survey products and users are urged to make use of them while analysing the data.

Three related methods can be used to interpret and evaluate the precision of the estimates: the standard error and two other methods also based on standard error; confidence intervals and coefficients of variation.

### Interpretation using standard error

The sampling error, or standard error, is a measure that quantifies how different repeated sample estimates might be from one another. Using the same sampling plan, if a large number of samples were to be drawn from the same population, then about 68% of the samples would produce a sample estimate that is within one standard error of the census value and in about 95% of the samples the estimate would be within two standard errors of the census value. Although the concept of sampling error is based on the idea of selecting several samples, in practice only one sample is drawn and the standard error is estimated based on the information collected from the units in that sample.

The same principles apply when looking at estimates of change, or the change between two estimates; for example month-to-month level changes. Approximately two-thirds (68%) of the time, a change greater than the sampling error indicates a real change. The larger the change compared to the standard error, the better the chance that we are observing a real change, as opposed to a change due to sampling variability. At the 95% confidence level, the change in the estimate must be greater than twice the sampling error in order to ensure that change is real.

To illustrate, let us say that between two months, the published estimate for total employment increases by 40,000 and the associated standard error for the movement estimate is 28,800. Since the increase is larger than the standard error, the confidence is at least two out of three (68%) that the increase of 40,000 in employment is a real change. To reach a 95% confidence level, the standard error has to be doubled. Because the increase of 40,000 in employment is smaller than twice the standard error (57,600), it is impossible to state with a 95% confidence level that there was an increase in employment.

Movements in estimates that are smaller than the sampling error are less likely to reflect a real change and more likely to be due to sampling variability. While the above is true for monthly movements, one can have more confidence in a series of consecutive movements in the same direction, even though some of the monthly movements may be smaller than the sampling error.

### Interpretation using confidence intervals

Confidence intervals provide another way of looking at the variability inherent in estimates of sample surveys. To illustrate how to calculate the confidence interval, let us say that one month the published estimate for total employment rose by 16,000 to reach 17,800,000. The associated standard error for the movement estimate is 28,800. Using the standard error to build the confidence intervals, we can say that:

- There are approximately two chances in three (68%) that the real value of the movement between the two months falls within the range -12,800 to +44,800 (16,000 + or – one standard error).
- There are approximately nine chances in ten (90%) that the real value of the movement between the two months falls within the range -30,100 to +62,100 (16,000 + or – 1.6 times the standard error).
- There are approximately nineteen chances in twenty (95%) that the real value of the movement between the two months falls within the range -41,600 to +73,600 (16,000 + or – two standard errors).

### Interpretation using coefficient of variation

Sampling variability can also be expressed relative to the estimate itself. The standard error as a percentage of the estimate is called the coefficient of variation (CV) or the relative standard error. The CV is used to give an indication of the uncertainty associated with the estimates. For example, if the CV is 7%, then in 68% of the samples the census value will lie within plus or minus 7% (or one CV) of the estimate and in 95% of the samples the census value will lie within plus or minus 14% (or two times the CV) of the estimate.

Small CVs are desirable because they indicate that the sampling variability is small relative to the estimate. The CV depends on the size of the estimate, the sample size the estimate is based on, the distribution of the characteristic being measured in the sample, and the use of auxiliary information in the estimation procedure. The size of the estimate is important because the CV is the sampling error expressed as a percentage of the estimate; the smaller the estimate, the larger the CV (all other things being equal). For example, when the unemployment rate is high, the CV may be small. If the unemployment rate falls due to improved economic conditions, then the corresponding CV will become larger. Typically, of similar estimates, the one with larger sample size will yield the smaller CV. This is because the sampling error is smaller.

Also, estimates referring to characteristics that are more clustered will have a higher CV. For example, persons employed in forestry, fishing, mining, oil and gas extraction in Canada are more clustered geographically than employed women aged 55 and older in Quebec. The latter will have a smaller sampling variability although the estimates are of approximately the same size.

Finally, estimates referring to age and sex are usually more reliable than other similar estimates because the LFS sample is calibrated to post-censal population projections of various age and sex groupings. As an example, persons employed part-time in Saskatchewan will have a larger sampling variability than employed men aged 25 to 54 years in New Brunswick even though the estimates are of similar size.

### Variability of monthly estimates for Canada and the provinces

To look up an approximate measure of the CV of an estimate of a monthly total, please consult Table 7.1, which gives the size of the estimate as a function of the geography and the CV. The rows give the geographic area of the estimate, while the columns indicate the resulting level of accuracy in terms of the CV, given the size of the estimate. To determine the CV for an estimate of size X in area A, look across the row for area A, find the first estimate that is less than or equal to X. The title of that column will give the approximate CV. For example, to determine the sampling error for an estimate of 35.1 thousand unemployed in Newfoundland and Labrador in January 2015, we find the closest but smaller estimate of 26.9 thousand, giving a CV of 5%. Therefore, the estimate of 35,100 unemployed in Newfoundland and Labrador has a CV of roughly 5%.

Table 7.1 is supplied as a rough guide to the sampling variability. The sampling variability is modeled so that, given an estimate, approximately 75% of the actual CVs will be less than or equal to the CVs derived from the table. There will, however, be 25% of the actual CVs that will be somewhat higher than the ones given in the table.

The CV values given in Table 7.1 are derived from a model based on LFS sample data for the 48-month period from January 2011 through December 2014 inclusive. It is important to bear in mind that these values are approximations.

Table 7.1 can be used with either seasonally adjusted estimates, or with estimates that have not been seasonally adjusted. Studies have shown that LFS standard errors for seasonally adjusted data are close to those for unadjusted data, particularly when estimates are for larger populations and domains.

### Variability of annual estimates for Canada and the provinces

To look up an approximate measure of the CV of an estimate of an annual average, please consult Table 7.2, which gives the size of the estimate as a function of the geography and the CV. The rows give the geographic level of the estimate, while the columns indicate the resulting level of accuracy in terms of the CV, given the size of the estimate. To determine the CV for an estimate of size X in area A, look across the row for area A, find the first estimate that is less than or equal to X. The title of that column will give the approximate CV. For example, to determine the sampling error for an annual average estimate of 32.3 thousand unemployed in Newfoundland and Labrador in 2014, we find the closest but smaller estimate of 29.2 thousand, giving a CV of 2.5%. Therefore, the estimate of 32,300 unemployed in Newfoundland and Labrador has a CV of roughly 2.5%.

Table 7.2 is supplied as a rough guide to the sampling variability. The sampling variability is modeled so that, given an estimate, approximately 75% of the actual CVs will be less than or equal to the CVs derived from the table. There will, however, be 25% of the actual CVs that will be somewhat higher than the ones given in the table.

The CV values given in Table 7.2 are derived from a model based on LFS sample data for the 5-year period from 2010 to 2014. It is important to bear in mind that these values are approximations.

### Sampling variability tables for the territories

The CV values for three-month moving averages given in Table 7.3 for the Yukon, Northwest Territories and Nunavut are derived from a model based on LFS sample data for the 48-month period of January 2011 through December 2014 inclusive. The CV values for annual averages given in the same table are derived from a model based on LFS sample data for the 5-year period of 2010 to 2014.

For more accurate measures of variability, please contact Statistics Canada's Statistical Information Service (toll-free 1-800-263-1136; international 1-514-283-8300; infostats@statcan.gc.ca).

### Variability of rates

Estimates that are rates and percentages are subject to sampling variability that is related to the variability of the numerator and the denominator of the ratio. The various rates given are treated differently because some of the denominators are calibrated figures that have no sampling variability associated with them.

### Unemployment rate

The unemployment rate is the ratio of X, the total number of unemployed in a group, to Y, which is the total number of participants in the labour force in the same group. Here the group may be a province or CMA and/or it may be an age-sex group. For example, in January 2015, there were more than 35,000 unemployed persons in Newfoundland and Labrador and 260,300 participants in the labour force, giving an unemployment rate of 13.5%.

The CV for the unemployment rate can be estimated with the following formula:

[CV(X/Y)]^{2} = [CV(X)]^{2} + [CV(Y)]^{2}– 2p[CV(X)] [CV(Y)]

where CV(X) would be the CV for the total number of unemployed in a specific geographic or demographic subgroup and CV(Y) would be the CV for the total number of participants in the labour force in the same subgroup. The correlation coefficient, denoted p, measures the amount of linear association between X and Y (respectively, the number of unemployed and the number of participants in the labour force in the same subgroup). The value of p ranges between -1 and 1. For example, a strong positive linear association would indicate that unemployment counts generally increase as the total number of participants in the labour force increases. Note that we can expect a larger CV for the unemployment rate when p is negative, since in this case, the third term on the right side of the equation above becomes positive.

When p is not available, the most conservative approach is to take p = -1, which leads to the simplified formula:

CV(X/Y) = CV(X) + CV(Y)

Note that this will likely lead to an overestimation of the CV(X/Y).

In the previous example, the CVs of the monthly estimates for the unemployment count and the total number of participants in the labour force in Newfoundland and Labrador are respectively 5.0% and 1.0% from Table 7.1. An approximation of the CV for the unemployment rate of 13.5% using the above formula would be:

5.0% + 1.0% = 6.0%

Note that, in the case of this particular estimate, the above approximation is only slightly above the 5.9% CV that is estimated using complex computer-intensive variance estimation methods.

### Participation rate and employment rate

The participation rate represents the number of persons in the labour force expressed as a percentage of the total population size. The employment rate is the total number of employed divided by the total population size. For both the above rates, the numerator and the denominator represent the same geographic and demographic group.

For Canada, the provinces, CMAs and some age-sex groups, the LFS population estimates are not subject to sampling variability because they are calibrated to independent sources. Therefore, in the case of the participation rate and the employment rate of these geographic and demographic groups, the CV is equal to that of the contributing numerator.

Subgroups of Canada, the provinces and age-sex groups are called domains; for example, persons employed in agriculture in Manitoba are a domain. To determine the CV of rates in the case of domains, the variability of both the numerator and the denominator has to be taken into account because the denominator is no longer a controlled total and is subject to sampling variability. Therefore, for participation rates and employment rates of domains, the CV can be determined similar to the unemployment rate. The totals in the numerator and denominator for the relevant rate should reflect the same domain or subgroup.

### Variability of estimates of change

The difference of estimates from two time periods gives an estimate of change that is also subject to sampling variability. An estimate of year-to-year or month-to-month change is based on two samples which may have some households in common. Hence, the CV of change depends on the CV of the estimates for both periods and the correlation p, between the periods.

The value of p ranges between -1 and 1, with 1 being the perfect positive linear association. One can generally use the sample overlap to approximate the correlation coefficient as follows:

- For the provinces: use p = 5/6 for month-to-month changes, and p = 0 for year-to-year changes.
- Empirical studies at Statistics Canada have shown that for the provinces, a value of p equal to 5/6 is a good approximation for estimates of employment, but for estimates of unemployment, a p of 0.45 would yield a better approximation for month-to-month changes.

While the CV and the standard error are related measures, the former is used to assess the variability of the estimate levels, and the latter is used to assess the variability of the difference between these estimates. The standard error of the change between estimates can be derived from the following formula:

where Y_{1} and Y_{2} are the estimates for the two periods.
The value of p is the correlation coefficient between Y_{1} and
Y_{2}.

When multiplying the CV obtained from this formula by the estimate change
(y_{2}- y_{1}), we obtain the standard error
(the CVs should be expressed in decimals for this calculation).

With the standard error, we can see which changes (differences between
estimates) are statistically significant and which are not. If the standard
error of y_{2}- y_{1} is larger in magnitude
than the value of y_{2}- y_{1}, then the latter
is not statistically significant.

Note: For the change between estimates (y_{2}- y_{1}), the CVs can be very high and sometimes negative (which is expected
when y_{2}- y_{1} is negative). The quality
of a negative CV is the same as that of an equal, but positive, CV value.

When comparing the annual averages of two years, the CV of the annual estimates (Table 7.2) should be used. For month-to-month change, seasonally adjusted estimates should be used in conjunction with the CVs of the monthly estimates from Table 7.1. Note that the above formula gives an approximate estimate of the sampling variability associated with an estimate of change.

### Guidelines on data reliability

Household surveys within Statistics Canada generally use the following guidelines and reliability categories in interpreting CV values for data accuracy and in the dissemination of statistical information.

Category 1 - If the CV is ≤ 16.5% - no release restrictions: data are of sufficient accuracy that no special warnings to users or other restrictions are required.

Category 2 - If the CV is > 16.5% and ≤ 33.3% - release with caveats: data are potentially useful for some purposes but should be accompanied by a warning to users regarding their accuracy.

Category 3 - If the CV > 33.3% - not recommended for release: data contain a level of error that makes them so potentially misleading that they should not be released in most circumstances. If users insist on inclusion of Category 3 data in a non-standard product, even after being advised of their accuracy, the data should be accompanied by a disclaimer. The user should acknowledge the warnings given and undertake not to disseminate, present or report the data, directly or indirectly, without this disclaimer.

## Confidentiality release criteria

Statistics Canada is prohibited by law from releasing any data which would divulge information obtained under the Statistics Act that relates to any identifiable person, business or organization without the prior knowledge or the consent in writing of that person, business or organization. Various confidentiality rules are applied to all data that are released or published to prevent the publication or disclosure of any information deemed confidential. If necessary, data are suppressed to prevent direct or residual disclosure of identifiable data.

The LFS produces a wide range of outputs that contain estimates for various labour force characteristics. Most of these outputs are estimates in the form of tabular cross-classifications. Estimates are rounded to the nearest hundred and a series of suppression rules are used so that any estimate below a minimum level is not released.

The LFS suppresses estimates below the levels presented in the Table 7.4.