Data quality, concepts and methodology: Data quality

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Sampling error arises from the fact that the information obtained from a sample of the population is applied to the entire population. Since the Survey of Drinking Water Plants is a census, the sampling error is zero.

Data response error may be due to questionnaire design, the characteristics of a question, inability or unwillingness of the respondent to provide correct information, misinterpretation of the questions or conceptual problems. These errors are controlled through careful questionnaire design and testing and the use of simple concepts and consistency checks.

Processing errors may occur at various stages of processing such as data entry, editing and tabulation. Measures have been taken to minimize these errors.

Non-response errors result when plants refuse to answer, are unable to respond or are too late in reporting. Missing data items are imputed for partial non-responses (that is, when mandatory questions are answered and some other questions are left unanswered).

Total non-response (that is, when mandatory questions are left unanswered) is dealt with by adjusting the weights assigned to the responding units, such that one responding unit might also represent other non-responding units with similar characteristics (that is, province, drainage region, source water type, size of population served). The error in the estimates due to total non-response is called total non-response error. The pattern of total non-response, the estimation method, the number of respondents and the variability associated with each measured variable determines the total non-response error. If the total non-respondents are assumed to be randomly "selected" from the population, then the respondents may be treated statistically as a random sample. Under this assumption, a possible measure of total non-response error is the coefficient of variation (CV). It represents the variability of the estimate as a proportion of the estimate.

A Excellent data quality
CV is 0.01% to 4.99%
B Very good data quality
CV is 5.00% to 9.99%
C Good data quality
CV is 10.00% to 14.99%
D Acceptable data quality
CV is 15.00% to 24.99%
E Use with caution
CV is 25.00% to 49.99%
F Too unreliable to be published
CV is >49.99% (data are suppressed)

Response rates

The response rate for the survey was 54% in reference year 2005, 56% in reference year 2006 and 56% in reference year 2007. Overall, 58% of the population responded in one or more of the reference years 2005, 2006 and 2007. Note: There was low response to the survey from drinking water plants in Nunavut. Nunavut is not included in the response rate, nor is it included in any of the data tables.

Error detection

Many factors affect the accuracy of data produced in a survey. For example, respondents may have made errors in interpreting questions, answers may have been incorrectly entered on the questionnaires, and errors may have been introduced during the data capture or tabulation process. Every effort was made to reduce the occurrence of such errors in the survey. Providing an electronic PDF questionnaire was intended to improve data quality in two manners: First, entry errors were reduced by building edits directly into the form. Secondly, PDF forms printed out by respondents improved the quality of the data captured by the imaging process.

Returned data are first entered and checked using capture and edit software. This procedure verifies that all mandatory cells have been filled in, that certain values lie within acceptable ranges, that questionnaire flow patterns have been respected, and that totals equal the sum of their components. Collection officers evaluate the edit failures and concentrate follow-up efforts accordingly. Phone follow-ups were performed to verify information in cases where edit checks had failed.

Further data checking is performed by subject matter officers who review returned data that has been identified statistically as outliers and who compare returned data from 2005, 2006 and 2007 to determine if data differences between years are reasonable. In some instances, collection officers are asked to confirm responses with the respondents. Subject matter officers also research drinking water plants (annual reports, web sites, etc.) in an effort to verify information submitted by respondents.

Imputation

Statistical imputation is used for records with incomplete questionnaire responses. Six methods of imputation were used for the Survey of Drinking Water Plants: deterministic imputation (only one possible value for the field to impute), imputation by linear regression, trend imputation, imputation by ratio, donor imputation (using a "nearest neighbour" approach to find a valid record that is, most similar to the record requiring imputation in terms of treated water volume and other characteristics) and manual imputation. The criteria for ratio and donor imputation were various combinations of water treatment type, source water type and geographical location (province, region, or Canada). No imputation was conducted on water quality variables.

Estimation

In the estimation process, the response values are multiplied by a factor (weight) adjustment to account for plants in the population who could not be contacted or who refused to participate in the survey. No estimation was conducted on water quality variables.

Quality evaluation

This is the first time this survey has been conducted. In addition to analyzing individual responses for consistency within a questionnaire, both individual responses and weighted estimates of totals were compared to outside sources, where possible.

Next | Previous