Data quality, concepts and methodology: Data quality

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Next | Previous

All data, from whatever source, are subject to error. The Industrial Water Survey is no exception. There are two general categories of error in surveys. The first is sampling error which arises from the fact that a sample or subset of the target population is used to represent the population. The size of sampling error is quantifiable. The second category is referred to as non-sampling error and is not as easily quantified. Non-sampling error refers to all the other kinds of error that arise in surveys. For example, incomplete or inaccurate lists of the general population, respondent misinterpretation of questions, provision of erroneous information, failure to respond, information processing errors, etc.

Typically the sampling error is measured by the expected variability of the estimate from the true value, expressed as a percentage of the estimate. This measure is referred to as the coefficient of variation (CV) or the standard deviation. Coefficients of variation of the final estimates were computed for the Industrial Water Survey and are indicated on the statistical tables. The quality of the estimates was classified as follows:

A. Excellent

CV is 0.01% to 4.99%

B. Very good

CV is 5.00% to 9.99%

C. Good

CV is 10.00% to 14.99%

D. Acceptable

CV is 15.00% to 24.99%

E. Use caution

CV is 25.00% to 34.99%

F. Unreliable

CV is > 34.99% (data are suppressed)

As mentioned in the previous section on "data collection and processing", every attempt was made to eliminate the non-sampling error through collection and data validation techniques.

Response rates

The response rate for the manufacturing and mining components of the survey was 70% in reference year 2005 while it was 88% for the thermal-electric component. The total water intake variable and the total water discharge variable were considered mandatory. Without these two variables, a record was considered to be a "total non-response" to the survey. At the end of the collection cycle, the sample was re-weighted to account for the "total non-response" units.

Error detection

Many factors affect the accuracy of data produced in a survey. For example, respondents may have made errors in interpreting questions, answers may have been incorrectly entered on the questionnaires, and errors may have been introduced during the data capture or tabulation process. Every effort was made to reduce the occurrence of such errors in the survey.

Returned data were first checked using an automated edit-check program (BLAISE) immediately after capture. This first procedure verifies that all mandatory cells have been filled in, that certain values lie within acceptable ranges, that questionnaire flow patterns have been respected, and that totals equal the sum of their components. Collection officers evaluate the edit failures and concentrate follow-up efforts accordingly.

Further data checking is performed by subject matter officers who compare historical data (if available) with returned data to determine if differences between survey cycles are reasonable. If not, collection officers are asked to confirm with respondents their responses. Subject matter officers also research companies (annual reports, web sites, etc.) in an effort to verify information submitted by respondents.

Imputation

Statistical imputation was used for partial-response records. Four methods of imputations were used for the Industrial Water Survey: Deterministic imputation (only one possible value for the field to impute), imputation by ratio, donor imputation (using a "nearest neighbour" approach to find a valid record that is most similar to the record requiring imputation) and manual imputation. The criteria for ratio and donor imputation were various combinations of industry group and geographical location (province, region, or Canada).

Estimation

The response values for sampled units were multiplied by a sampling weight in order to estimate for the entire surveyed population. The sampling weight was calculated using a number of factors, including the probability of the unit being selected in the sample. Raising the factor (weight) adjustment was used in the estimation process to account for respondents who could not be contacted or who refused to complete the survey.

Quality evaluation

It has been almost ten years since this survey was last conducted. In addition to the extended lapse of time between survey years, the use of different industrial classification systems and the different sampling strategies between the survey years make historical comparisons difficult. A historical comparison of the 2005 results with the 1996 results was done mainly to see if the largest industrial and regional users were the same.

Next | Previous