Data Accuracy and Validation: Methods to ensure the quality of data - Transcript
Data Accuracy and Validation: Methods to ensure the quality of data - Transcript
(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Data Accuracy and Validation: Methods to ensure the quality of data")
Data Accuracy and Validation: Methods to ensure the quality of data
Assessing the accuracy of data is an important part of the analytical process.
Accuracy is one of the six dimensions of data quality used at Statistics Canada. Accuracy refers to how well the data reflects the truth or what actually happened. actually happened. In this video we will present methods to describe accuracy in terms of validity and correctness. We will also discuss methods to validate and check the accuracy of data values.
Steps of a data journey
(Diagram of the Steps of the data journey: Step 1 - Find, gather, protect; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)
This diagram is a visual representation of the steps involved in turning data into knowledge.
Step 2: Explore, clean and describe
(Diagram of the Steps of the data journey with an emphasis on Step 2 - explore, clean, describe.)
Accurate data correctly describes the phenomena they were designed to measure or represent. before we use data. We should explore it to learn about the variables and concepts and also to discover if there are errors, inconsistencies or gaps in the data. This video looks at ways to explore the accuracy of data.
What does it mean for data to be accurate?
What does it mean for data to be accurate? Accurate data is a reflection of reality. In other words, the data values are valid, so not blank or missing and the values are within a valid range. Accurate data is also correct. First, let's look at the concept of valid data. One method for exploring the validity of data is to do what we call of emo analysis. Vimo is an acronym for valid, invalid, missing an outlier data values.
(Table of values on screen listing houshold ID, their respective spending on food and total spending on housing. One of the table cells is occupied by name "blue" and not a dollar ammount.)
On the previous slide, we defined valid data as being not blank or missing, and within a valid range of values. Invalid data, on the other hand, has values that are impossible. An example would be a variable that should have a dollar amount, such as spending on housing, having the value blue. That makes no sense.
(Table of values on screen listing houshold ID, their respective spending on food and total spending on housing. One of the table cells is empty and does not contain a dollar ammount.)
Missing values are where the variable is left blank. For example, we would expect either a 0 or a number for the value of total expenses.
(Table of values on screen listing the name of individuals, their respective occupation and age. One individual is listed as being 103 years old and another as being 301 years old.)
Outlier values are extremely small or extremely large compared to what we would expect. Some outlier values are actually true. For example, a person's age could be 103 years, although this is quite rare. Other times, outlier values are also invalid, such as the value of 301 for a living person's age in years.
One way to do of emo analysis is to produce a frequency distribution distribution of key variables and look at the proportion of valid, invalid, missing and outlier values. What proportion of valid values is acceptable? Is at 100%? Or something lower? Look at the range of values for key variables, ignoring the missing and invalid values for a moment, is the range and distribution of values realistic? Where the values are invalid or missing, is it easy to tell if they should actually be 0 or are they not applicable? Or should there be some other value? Another way to explore the validity of data is to use datavisualization techniques, such as plotting the data on an axis. This is a straightforward way to quickly detect if there are patterns or anomalies in the data. There are software tools to detect outlier values and do data visualization. Remember that not all unusual values are necessarily wrong.
Example: Detecting invalid values
(Diagram of a barchart presenting the number of footwear sold online. The listed types of boots are, from the left: Winter boots; Rubber boots; Sandals; Running shoes; Umbrellas.)
In this made up example, we use a bar chart which is a very simple data visualization method to look at the frequency distribution of the types of footwear sold online. The Heights of the bars looked to be all within the same range. However, we notice on the horizontal axis that one of the bars is for umbrellas. You can't wear umbrellas on your feet. This is invalid. Further investigation is needed to figure out if the data in the bar actually represents some other type of footwear and the label umbrella was erroneously assigned, or if somehow the count of umbrellas got into the chart of footwear sales by accident.
Example: Detecting missing values
(Table on screen presenting a data distribution for Apples (A), Oranges (O) and Bananas (B). The following columns represnt the count values at 0 (A=0; O=0; B=1), 3 (A=1; O=0; B=0), 5 (A=0; O=2; B=0), 8 (A=0; O=0; B=2). The last columns represnts the count of missing values (A=5; O=7; B=6).)
In this example, we created a frequency distribution table of the values for three variables, apples, oranges and bananas. The column on the far right shows how many times they were missing values for each of these three variables. Remember that missing values are not the same as values equal to 0. In this example, there are a lot of missing values relative to the number of non missing values, so we would probably want to try to fill them in before using this data.
Example: correcting missing values
(Text on screen: There are many missing values in this table. Some are easy to fill by adding or substracting; Others we cannot fill without makingsome assumptions or finding additional information.)
(Table on screen presenting data values for the same table presented in the previous slide where the columns represnt the Row, Apples, Oranges and Total fruit (TF). the values are as listed: Row 1 (A=3; O=5; TF=-); Row 2 (A=-; O=5; TF=8); Row 3(A=-; O=-; TF=0); Row 4(A=-; O=-; TF=8).)
Following through with the outliers detected on the previous slide, here we see how we could correct them in this table of actual data values. We see where the missing values are. In the first row, it's easy to see that if we have three apples and five oranges, the missing value for the total number of fruit should be 8. Similarly, it's not hard to determine that the missing number of apples in the 2nd row is 3. However, in the 3rd row, the O could be correct, in which case the missing values for apples and oranges should also be 0. However, if the 0 total is wrong, then we don't know what the value of any of the three variables should be. In the 4th row, if the total is indeed 8, then we do not have enough information to know what the value is for. Apples and oranges should be. We only know that they're between zero and eight.
Example: detecting outlier values
(Scatter plot on screen with random dots where all but one red dot are approximatly aligned. 2 trendlines are added to represent said linearity.)
(Text on screen: This value (red dot) is further from all the other data values than we would expect.)
In this made up example, the data points represented by the green and red dots have been plotted on a horizontal and vertical axis. Two different methods have been used to estimate the central tendency of the data values. Those are represented by the red and blue lines. Most of the data values fall on or near both of the fitted lines. However, the Red Point is way off the lines. It's an outlier value. Further investigation is needed to determine what makes this data point so different and what should be done with it. Some outlier values are correct even though they are unusual.
Exploring the correctness of data
(Text on screen: Micro-data: For example a list of people with their occupation and date of birth. Macro data: Less detailed, like zooming out with a camera. For example: Micro data produced from a list of people with their occupation and date of birth could be counts of people by age categories and by occupational groups. Micro data is more granular than macro data, at a more detailed level.)
We said earlier that accurate data is both valid and correct. We looked at the vemoa analysis as a way to explore the validity of data. Now let's focus on the correctness of data. But first, we need to differentiate between looking at individual data values or micro data and looking at those values summarized up to a higher level or macro data microdata is more granular than macro data. At a more detailed level.
Exploring correctness of data
(Text on screen: Exemple 2: a 12year-old has a Master's degree in biology, is married and is employed by the University of Manitoba. Does this makes sense?)
One method to explore the correctness of data is to compare it to other related information. We could look at the reasonableness of values across a single data record. Are there variables that should make sense together? For example, if there are a total and the parts that make up that total is the sum correct? Another example is to look at a person's current age and compare that to the highest level of education attained or marital status or employment status. Does it make sense?
We could also look for commonality with standards, for example, in Canada, the 1st letter of the Postal code is determined by which province the addresses in all Postal codes in Newfoundland and Labrador start with a all Postal codes in Nova Scotia start with B and so on. If this is not the case then one of the pieces of information is incorrect.
(To answer these questions it is necessary to have reliable "facts" about the real world.)
Yet another way to explore correctness is to compare what's in the data with what's happening in the real world. You could calculate summary statistics such as totals and averages for car sales across Canada and compare across provinces or through time. Do the numbers make sense? Does the auto industry track these numbers and how to your numbers compared to theirs?
Tips for exploring correctness of data: Part 1
Here are some tips to make the comparisons easier. Before trying to compare data values, put them into a common format. The 12th of June 2018 will look different if the month is listed first in one case and the day is listed first in another. As well as using standard formats, use standard abbreviations, concepts and definitions to the extent possible. For example, in Canada we have a standard two letter code for the names of all the provinces and territories.
Tips for exploring correctness of data: Part 2
Using data visualization is a great way to spot anomalies in data before you get started, think about what level of incorrectness you can tolerate in the data, what's adequate for your purpose. Once you find discrepancies, use automation to correct errors in an efficient, consistent and objective manner.
Describing accuracy of data
(Text on screen: Document Clearly: The level of accuracy in terms of validity and correctness of the data once you have finished exploring and cleaning the data. This documentation could be of interest to: Those who will use the data and to those who will be responsible for exploring, cleaning and describing other similar data.)
Before using the data or passing it to stakeholders who will use the data, be sure to describe the accuracy of the data. The documentation describing the data is sometimes referred to as metadata. Document the methods you used to explore the validity and correctness of the data, as well as the methods you use to clean or improve the data. This is what users of the data need to know so they can use it responsibly.
Recap of key points
This video presented the basic concepts of accuracy and data validation. Vimo analysis recommends the use of frequency distributions of key variables to assess the proportion of valid, invalid missing an outlier values. Data visualization techniques and the use of common formats. An automation help to ensure efficient correct results. In addition, clear documentation is essential to gain insight into the methods used to explore and validate the data.
(The Canada Wordmark appears.)