Data quality toolkit

Release date: September 27, 2017 More Information

The objective of this toolkit is to raise awareness about
data quality assurance practices.

Context
Quality attributes
Data quality assurance practices
Checklists

1. Context

Question: what do you get when you combine "data" and "quality"?

Answer: the Data quality toolkit!

Well, technically, you just get "data quality", but in the context of this webpage you can consider the toolkit something of an extremely useful bonus.

But, talk of bonuses aside, a question remains: what, exactly, is data quality?

To answer this question, we must first answer two more:

What are data?
What is quality?

Data are made up of numbers, letters and symbols. When organized into sets, phrases, or patterns, data become information. We use information to identify needs, measure impacts and inform decisions. If the data underlying that information are incorrect in some respect, then our conclusions and decisions could also be wrong or misleading.

Quality is gauged in terms of various attributes discussed in the following section, which will vary depending on from which point of view we are assessing things.

With respect to the data producer, measures of quality include reproducibility of the process, timeliness and punctuality in delivery of data and metadata, willingness and availability to support users of the data and perception of authority and trustworthiness; with respect to the actual data and metadata, quality measures include relevance and usefulness, coverage, granularity, accuracy and reliability, and standardization or conformance.

All of this leads us to data quality, a concept which is shaped by the two just outlined above and which in turn provides two ways to determine whether data are likely to be correct or not:

describe what was done during the gathering and processing of the data to ensure that the data are correct
observe measurable characteristics of the data

Following good data quality assurance practices does not guarantee that the data are correct, but it does reduce the likelihood of errors. Completing a data quality assessment is a way of measuring the extent to which the data are protected against errors, and sharing that assessment with data users gives them confidence in the quality of the data.

2. Quality attributes

Quality attributes related to the data producer^{Definition of data producer}

Quality assurance practices^{Examples of quality assurance practices}: The extent to which targeted and documented quality assurance practices were followed in the gathering and processing of the data, both through commitment of the data producer at an organizational level and implementation of monitoring and reporting practices at the working level.

Reproducibility of the process: The extent to which the data production process is reproducible or repeatable. Examples of non-reproducible processes would be ad-hoc processes or instances where intermediate steps or data files were not archived and cannot be recreated.

Timeliness and punctuality: Timeliness refers to the delay between the end of the reference period to which the data pertain, and when they are available to users. Ideally this delay is brief, and the data and metadata should be available at the same time. Punctuality refers to how reliably the data and metadata are available at the expected time, as scheduled or promised by the data producer.

Contactability: The willingness and accessibility of the data producer to discuss the data with potential users, and even to facilitate usage of the data.

Viability: The extent to which one can expect the data producer to continue producing these data for a reasonable length of time into the future.

Perception of authority, impartiality and trustworthiness: The extent to which the data producer is perceived as authoritative on the subject matter of the data, is immune to undue influence of its stakeholders or other external bodies and is worthy of trust.

Security: The extent to which data security is protected in all holdings and transmissions, and access to data during production is restricted to only those with appropriate training and authority. In particular, access is granted on a "need to know" basis.

Quality attributes related to the data and metadata^{Definition of metadata}

Relevance and usefulness: The extent to which the data pertain to the desired phenomenon. Data would be considered less relevant if they are too old, or do not include information about topics of interest. Usefulness of metadata refers to the extent to which it describes the data in terms of methods, concepts, limitations, assumptions made, and quality assurance practices followed.

Coverage: The extent to which the data represent the entire desired phenomenon. This could be assessed in terms of temporal or geographic coverage, or coverage of population units (i.e., people, households, businesses). Coverage is sometimes referred to as completeness (particularly when referring to metadata).

Granularity: Granularity refers to the unit or level of a single record in the dataset. For example a highly granular dataset could contain records of people, medical procedures or lakes, while a less granular dataset could contain records aggregated to the level of a province, or a year. The more granular or local a dataset, the greater the perceived value, balanced by greater need to protect data from unauthorized disclosure. It is usually straight-forward to aggregate or roll-up from granular data to a less granular level, but rolling down from an aggregate level is not usually possible.

Accuracy and reliability: Accuracy refers to the extent to which the data correctly describes the phenomenon they are supposed to measure. Reliability is the extent to which the data are accurate consistently over time. Accuracy is often decomposed into precision, which measures how similar are repeated measurements of the same thing, and bias, which measures any systematic departures from reality in the data. Other factors contributing to accuracy and reliability are validity, the extent to which variables in the dataset have values that correspond to expected outcomes, and consistency, the extent to which the data are free of contradiction.

Standardization or conformance: The extent to which the data and metadata follow recognized standards in terms of formats and naming conventions, and conform to recognized dissemination standards such as SDMX for statistical products. Other aspects of standardization and conformance are the use of industry-standard software and file formats, and controlled vocabulary for data values where appropriate.

Protection of sensitive information: Unless consent has been explicitly given, it is not acceptable to disclose sensitive information in datasets made available to users beyond those granted specific access. Sensitive information includes, but is not limited to, identifiers that would associate granular data to a person, household or business, or sufficient detail in aggregate data such that one could deduce attributes of a person, household or business. There are various methods for protecting data against disclosure of sensitive information, depending on the nature and granularity of the data. Examples include suppression of sensitive information and introduction of random disturbance to data values. Many disclosure control algorithms provide diagnostics of the level of protection achieved.

Combinability or linkability: The extent to which it is possible to integrate two or more sources of data. For example unique identifiers such as social insurance number (SIN), business number, health insurance number can be matched directly, while somewhat unique identifiers such as combinations of name, sex, date of birth, address can be linked using statistical matching algorithms based on probabilities. The success of integrating datasets is improved when the concept of what is represented by a single record from each dataset is well aligned.

Accessibility: The ease with which users can obtain and use the data and metadata. Highly accessible data and metadata have relevant and appropriate labels, keywords and tags so that they are discoverable electronically; are in commonly used formats and software; are downloadable or available through transparent or navigable processes. Accessibility also involves reducing barriers to access, including cost.

Processability and understandability: The ease with which users can manipulate, interpret, explore, analyze, or otherwise use the data and metadata. An important component of this is the extent to which metadata and other support from the data producer lead to correct use of the data, for example through the inclusion of appropriate quality indicators.

Perception of reliability and credibility: The extent to which the data are perceived to be reliable and the metadata are perceived to be credible.

3. Data quality assurance practices

This is a set of good practices that can be followed by any organization producing data. Data producers can adapt these practices to their own environment, and are encouraged to document the data quality assurance practices that they follow and to share that documentation with their data users. Knowing what data quality assurance practices were followed in the production of data builds confidence that the data themselves are of good quality. These quality assurance practices are a subset of those found in Statistics Canada's Quality Assurance Framework and Quality Guidelines.

Data quality assurance practices for producing registers and databases

Use known unique identifiers (SIN, Business Number, health card number, …), with appropriate protection of sensitive information
Use check-digits on known unique identifiers to ensure valid values
Use drop-down menus, look-up tables or reference lists for variables that should have a fixed codeset^{Definition of codeset}
Use recognized standard formats wherever possible, i.e., ISO 8601 for dates (YYYYMMDD) and time (HH:MM), standard province abbreviations (ON, MB, etc.)
Include built-in edits to alert when outliers or unexpected entries are made
Validate aggregated or tabulated data against other sources
Use a logical, documented naming convention for variables and files
Document inclusion and exclusion rules, procedures to be followed and quality checks
Produce output datasets at regular, predictable intervals (the last day of every month, the last day of the year)
Define and implement a strategy for back-up, storage and retention

Data quality assurance practices for survey data (sample or census)

Use statistically sound methods for sampling, weighting and estimation
Ensure all methods are documented and reproducible
Ensure survey frame is as up to date, complete and as accurate as possible
Document frame and sample coverage with respect to time period, geographic coverage and population units
Test questionnaire flow and interpretability
Choose a collection method appropriate for the target population and the subject matter, given cost considerations and other factors
Use a quality control technique such as Statistical Process Control^{Definition of Statistical Process Control} to ensure that collected data are accurate
Make at least one attempt to contact every sampled unit, and track contact attempts
Use editing resources efficiently and effectively; in other words, make data fit for purpose^{Definition of fit for purpose}, not perfect
Validate aggregated or tabulated data against other sources

Data quality assurance practices for producing scanned data, satellite data or meter data

Use a quality control technique such as Acceptance Sampling^{Definition of acceptance sampling} or "ground truthing"^{Definition of ground truthing} to ensure that data are accurate
Ensure that changes are conveyed to all users e.g., when a new UPC code is introduced for a product

Data quality assurance practices for combining data from different sources

Ensure that definitions align for: concepts; populations of interest; units of observation; reference periods
Report all data sources and what contribution they make to the final product
Analyze non-matching or leftover data to see why they did not match
Ensure all methods are documented and reproducible

Data quality assurance practices for metadata (documentation)

Include documentation needs in project planning and resource allocation
Document as you go; don't leave it all to the end
Use templates and standard naming conventions
Describe all concepts: the population covered by the data; any limitations or exceptions in the data; the reference period
Describe all methods used in sampling, data collection, data entry, editing, combining data from various sources, tabulation
Describe data security measures
Describe quality assurance practices followed
Describe measures to protect against the disclosure of sensitive information
Provide summary statistics on key variables (mean, median, mode, range, set of valid values)
Provide a data dictionary or controlled vocabulary set for variables, where appropriate
Use recognized standard formats wherever possible, i.e., ISO 8601 for dates (YYYYMMDD) and time (HH:MM), standard province abbreviations (ON, MB, etc.)
Make documentation available to data users
Use relevant and appropriate labels, keywords and tags so that the data and associated metadata are discoverable electronically
Track and document updates and revisions

Data quality assurance practices for data security, accessibility and protecting against the disclosure of sensitive information

In the data production process, restrict access to only those who have appropriate training and authority and a defined need to access the information ("need to know")
In the data production process, protect security of data in all holdings and all transmissions through encryption and other techniques
Adopt the "single source of truth" strategy for minimizing duplication of information and effort, in part through efficient database structures
Use standard formats for names, dates, addresses, and other commonly used variables (international, regional or national standards where appropriate, for example ISO 8601 for dates and time (YYYYMMDD HH:MM using the 24 hour clock)
Use standard software and file formats for files made available to other users
Plan and prepare to share datasets at the lowest possible level of granularity (detail)
Do regular backups
Define and implement a retention and storage strategy
Protect against the disclosure of sensitive information (the identity or attributes of any person or business), by masking values^{Definition of masking values} and other techniques

4. Checklists

Data producer quality self-assessment checklist Data user quality assessment checklist

Thank-you for reading the toolkit! We want to make it better for you. Please take a moment to let us know which parts of it you find useful, what's missing, how we can make it better. We're also happy to answer your questions. Please send an email to the Statistics Canada Quality Secretariat.

Language selection

Search and menus

Search

Data quality toolkit

1. Context

2. Quality attributes

Quality attributes related to the data producer^{Definition of data producer}

Quality attributes related to the data and metadata^{Definition of metadata}

3. Data quality assurance practices

Data quality assurance practices for producing registers and databases

Data quality assurance practices for survey data (sample or census)

Data quality assurance practices for producing scanned data, satellite data or meter data

Data quality assurance practices for combining data from different sources

Data quality assurance practices for metadata (documentation)

Data quality assurance practices for data security, accessibility and protecting against the disclosure of sensitive information

4. Checklists

Data quality toolkit

1. Context

2. Quality attributes

Quality attributes related to the data producerDefinition of data producer

Quality attributes related to the data and metadataDefinition of metadata

3. Data quality assurance practices

Data quality assurance practices for producing registers and databases

Data quality assurance practices for survey data (sample or census)

Data quality assurance practices for producing scanned data, satellite data or meter data

Data quality assurance practices for combining data from different sources

Data quality assurance practices for metadata (documentation)

Data quality assurance practices for data security, accessibility and protecting against the disclosure of sensitive information

4. Checklists

Note of appreciation

Standards of service to the public

Copyright

Quality attributes related to the data producer^{Definition of data producer}

Quality attributes related to the data and metadata^{Definition of metadata}