Data quality toolkit

The objective of this toolkit is to raise awareness about
data quality assurance practices.

1. Context

Question: what do you get when you combine "data" and "quality"?

Answer: the Data quality toolkit!

Well, technically, you just get "data quality", but in the context of this webpage you can consider the toolkit something of an extremely useful bonus.

But, talk of bonuses aside, a question remains: what, exactly, is data quality?

To answer this question, we must first answer two more:

  • What are data?
  • What is quality?

Data are made up of numbers, letters and symbols. When organized into sets, phrases, or patterns, data become information. We use information to identify needs, measure impacts and inform decisions. If the data underlying that information are incorrect in some respect, then our conclusions and decisions could also be wrong or misleading.

Quality is gauged in terms of various attributes discussed in the following section, which will vary depending on from which point of view we are assessing things.

With respect to the data producer, measures of quality include reproducibility of the process, timeliness and punctuality in delivery of data and metadata, willingness and availability to support users of the data and perception of authority and trustworthiness; with respect to the actual data and metadata, quality measures include relevance and usefulness, coverage, granularity, accuracy and reliability, and standardization or conformance.

All of this leads us to data quality, a concept which is shaped by the two just outlined above and which in turn provides two ways to determine whether data are likely to be correct or not:

  • describe what was done during the gathering and processing of the data to ensure that the data are correct
  • observe measurable characteristics of the data

Following good data quality assurance practices does not guarantee that the data are correct, but it does reduce the likelihood of errors. Completing a data quality assessment is a way of measuring the extent to which the data are protected against errors, and sharing that assessment with data users gives them confidence in the quality of the data.

2. Quality attributes

Quality attributes related to the data producerDefinition of data producer

Quality assurance practicesExamples of quality assurance practices: The extent to which targeted and documented quality assurance practices were followed in the gathering and processing of the data, both through commitment of the data producer at an organizational level and implementation of monitoring and reporting practices at the working level.

Reproducibility of the process: The extent to which the data production process is reproducible or repeatable. Examples of non-reproducible processes would be ad-hoc processes or instances where intermediate steps or data files were not archived and cannot be recreated.

Timeliness and punctuality: Timeliness refers to the delay between the end of the reference period to which the data pertain, and when they are available to users. Ideally this delay is brief, and the data and metadata should be available at the same time. Punctuality refers to how reliably the data and metadata are available at the expected time, as scheduled or promised by the data producer.

Contactability: The willingness and accessibility of the data producer to discuss the data with potential users, and even to facilitate usage of the data.

Viability: The extent to which one can expect the data producer to continue producing these data for a reasonable length of time into the future.

Perception of authority, impartiality and trustworthiness: The extent to which the data producer is perceived as authoritative on the subject matter of the data, is immune to undue influence of its stakeholders or other external bodies and is worthy of trust.

Security: The extent to which data security is protected in all holdings and transmissions, and access to data during production is restricted to only those with appropriate training and authority. In particular, access is granted on a "need to know" basis.

Quality attributes related to the data and metadataDefinition of metadata

Relevance and usefulness: The extent to which the data pertain to the desired phenomenon. Data would be considered less relevant if they are too old, or do not include information about topics of interest. Usefulness of metadata refers to the extent to which it describes the data in terms of methods, concepts, limitations, assumptions made, and quality assurance practices followed.

Coverage: The extent to which the data represent the entire desired phenomenon. This could be assessed in terms of temporal or geographic coverage, or coverage of population units (i.e., people, households, businesses). Coverage is sometimes referred to as completeness (particularly when referring to metadata).

Granularity: Granularity refers to the unit or level of a single record in the dataset. For example a highly granular dataset could contain records of people, medical procedures or lakes, while a less granular dataset could contain records aggregated to the level of a province, or a year. The more granular or local a dataset, the greater the perceived value, balanced by greater need to protect data from unauthorized disclosure. It is usually straight-forward to aggregate or roll-up from granular data to a less granular level, but rolling down from an aggregate level is not usually possible.

Accuracy and reliability: Accuracy refers to the extent to which the data correctly describes the phenomenon they are supposed to measure. Reliability is the extent to which the data are accurate consistently over time. Accuracy is often decomposed into precision, which measures how similar are repeated measurements of the same thing, and bias, which measures any systematic departures from reality in the data. Other factors contributing to accuracy and reliability are validity, the extent to which variables in the dataset have values that correspond to expected outcomes, and consistency, the extent to which the data are free of contradiction.

Standardization or conformance: The extent to which the data and metadata follow recognized standards in terms of formats and naming conventions, and conform to recognized dissemination standards such as SDMX for statistical products. Other aspects of standardization and conformance are the use of industry-standard software and file formats, and controlled vocabulary for data values where appropriate.

Protection of sensitive information: Unless consent has been explicitly given, it is not acceptable to disclose sensitive information in datasets made available to users beyond those granted specific access. Sensitive information includes, but is not limited to, identifiers that would associate granular data to a person, household or business, or sufficient detail in aggregate data such that one could deduce attributes of a person, household or business. There are various methods for protecting data against disclosure of sensitive information, depending on the nature and granularity of the data. Examples include suppression of sensitive information and introduction of random disturbance to data values. Many disclosure control algorithms provide diagnostics of the level of protection achieved.

Combinability or linkability: The extent to which it is possible to integrate two or more sources of data. For example unique identifiers such as social insurance number (SIN), business number, health insurance number can be matched directly, while somewhat unique identifiers such as combinations of name, sex, date of birth, address can be linked using statistical matching algorithms based on probabilities. The success of integrating datasets is improved when the concept of what is represented by a single record from each dataset is well aligned.

Accessibility: The ease with which users can obtain and use the data and metadata. Highly accessible data and metadata have relevant and appropriate labels, keywords and tags so that they are discoverable electronically; are in commonly used formats and software; are downloadable or available through transparent or navigable processes. Accessibility also involves reducing barriers to access, including cost.

Processability and understandability: The ease with which users can manipulate, interpret, explore, analyze, or otherwise use the data and metadata. An important component of this is the extent to which metadata and other support from the data producer lead to correct use of the data, for example through the inclusion of appropriate quality indicators.

Perception of reliability and credibility: The extent to which the data are perceived to be reliable and the metadata are perceived to be credible.

3. Data quality assurance practices

This is a set of good practices that can be followed by any organization producing data. Data producers can adapt these practices to their own environment, and are encouraged to document the data quality assurance practices that they follow and to share that documentation with their data users. Knowing what data quality assurance practices were followed in the production of data builds confidence that the data themselves are of good quality. These quality assurance practices are a subset of those found in Statistics Canada's Quality Assurance Framework and Quality Guidelines.

Data quality assurance practices for producing registers and databases

  • Use known unique identifiers (SIN, Business Number, health card number, …), with appropriate protection of sensitive information
  • Use check-digits on known unique identifiers to ensure valid values
  • Use drop-down menus, look-up tables or reference lists for variables that should have a fixed codesetDefinition of codeset
  • Use recognized standard formats wherever possible, i.e., ISO 8601 for dates (YYYYMMDD) and time (HH:MM), standard province abbreviations (ON, MB, etc.)
  • Include built-in edits to alert when outliers or unexpected entries are made
  • Validate aggregated or tabulated data against other sources
  • Use a logical, documented naming convention for variables and files
  • Document inclusion and exclusion rules, procedures to be followed and quality checks
  • Produce output datasets at regular, predictable intervals (the last day of every month, the last day of the year)
  • Define and implement a strategy for back-up, storage and retention

Data quality assurance practices for survey data (sample or census)

  • Use statistically sound methods for sampling, weighting and estimation
  • Ensure all methods are documented and reproducible
  • Ensure survey frame is as up to date, complete and as accurate as possible
  • Document frame and sample coverage with respect to time period, geographic coverage and population units
  • Test questionnaire flow and interpretability
  • Choose a collection method appropriate for the target population and the subject matter, given cost considerations and other factors
  • Use a quality control technique such as Statistical Process ControlDefinition of Statistical Process Control to ensure that collected data are accurate
  • Make at least one attempt to contact every sampled unit, and track contact attempts
  • Use editing resources efficiently and effectively; in other words, make data fit for purposeDefinition of fit for purpose, not perfect
  • Validate aggregated or tabulated data against other sources

Data quality assurance practices for producing scanned data, satellite data or meter data

Data quality assurance practices for combining data from different sources

  • Ensure that definitions align for: concepts; populations of interest; units of observation; reference periods
  • Report all data sources and what contribution they make to the final product
  • Analyze non-matching or leftover data to see why they did not match
  • Ensure all methods are documented and reproducible

Data quality assurance practices for metadata (documentation)

  • Include documentation needs in project planning and resource allocation
  • Document as you go; don't leave it all to the end
  • Use templates and standard naming conventions
  • Describe all concepts: the population covered by the data; any limitations or exceptions in the data; the reference period
  • Describe all methods used in sampling, data collection, data entry, editing, combining data from various sources, tabulation
  • Describe data security measures
  • Describe quality assurance practices followed
  • Describe measures to protect against the disclosure of sensitive information
  • Provide summary statistics on key variables (mean, median, mode, range, set of valid values)
  • Provide a data dictionary or controlled vocabulary set for variables, where appropriate
  • Use recognized standard formats wherever possible, i.e., ISO 8601 for dates (YYYYMMDD) and time (HH:MM), standard province abbreviations (ON, MB, etc.)
  • Make documentation available to data users
  • Use relevant and appropriate labels, keywords and tags so that the data and associated metadata are discoverable electronically
  • Track and document updates and revisions

Data quality assurance practices for data security, accessibility and protecting against the disclosure of sensitive information

  • In the data production process, restrict access to only those who have appropriate training and authority and a defined need to access the information ("need to know")
  • In the data production process, protect security of data in all holdings and all transmissions through encryption and other techniques
  • Adopt the "single source of truth" strategy for minimizing duplication of information and effort, in part through efficient database structures
  • Use standard formats for names, dates, addresses, and other commonly used variables (international, regional or national standards where appropriate, for example ISO 8601 for dates and time (YYYYMMDD HH:MM using the 24 hour clock)
  • Use standard software and file formats for files made available to other users
  • Plan and prepare to share datasets at the lowest possible level of granularity (detail)
  • Do regular backups
  • Define and implement a retention and storage strategy
  • Protect against the disclosure of sensitive information (the identity or attributes of any person or business), by masking valuesDefinition of masking values and other techniques

4. Checklists

Thank-you for reading the toolkit! We want to make it better for you. Please take a moment to let us know which parts of it you find useful, what's missing, how we can make it better. We're also happy to answer your questions. Please send an email to the Statistics Canada Quality Secretariat.

Date modified: