Statistics 101: Statistical Bias

Catalogue number: 892000062022005

Release date: October 17, 2022

In this video, we will explain the concept of statistical bias, which occurs when statistics differ systematically from the reality they are trying to measure because of problems with the way the data were produced.

Data journey step

Foundation

Data competency

Data analysis
Data quality evaluation
Identifying problems using data

Audience

Basic

Suggested prerequisites

N/A

Length

10:38

Cost

Free

Watch the video

Statistics 101: Statistical Bias - Transcript

(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Statistical Bias")

Statistics 101 Statistical Bias

In every-day language, bias refers to how a person's point of view, values or beliefs can influence their judgement or decisions in particular circumstances.

Learning Goals

Before we talk about bias, we will begin with a few words about error. Statistics are measurements that describe our society, economic activity, or other aspects of the world around us. While statistics try and estimate the true value as accurately as possible, they can often contain a certain level of error. Statistical bias is the difference between the statistical measure and the true value.

In this video, you will learn the answers to the following questions:

What are some of the different types of error?
What are some of the types of error that lead to statistical bias?
And where can errors which lead to statistical bias can occur throughout the data journey?

Steps in the data journey

(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)

This diagram is a visual representation of the data journey from collecting the data to exploring, cleaning, describing and understanding the data, analyzing the data and lastly to communicating with others the story the data tell.

Errors leading to statistical bias can occur at any step throughout the data journey.

What are the different types of error?

When trying to measure and analyze data, some level of error is to be expected, but what exactly do we mean when we say there are different types of error? To accept that errors exist is not necessarily a bad thing, but it is important to understand that not all errors are equal. Two main types of error we will learn about today are random error and systematic error.

Random vs Systematic Error

Random errors introduce variability between separate measurements of the same thing. For example, responses or measurements that are taken at different times can result in response variability, or another randomly selected sample can result in sampling variability. Randomness can also occur in the data processing procedures. Nevertheless, in these cases the measurements still tend to cluster around the true value. Therefore, despite some error, there are still accurate.

On the other hand, systematic errors result in non-random variability that skew or pull the measurement away from the true value, resulting in a measurement that may be smaller, bigger, higher or lower than the true value and can result in incorrect conclusions.

What is statistical bias?

Now that we understand the difference between random and systemic errors, and how systemic errors can lead to inaccurate conclusions, from this point on in the video, we will refer to such inaccurate conclusions as Statistical bias, because when we say Statistical bias, what we really mean is a statistic that differs from the reality it is trying to measure resulting from systemic errors in the way the data were collected, reported, and/or analyzed.

Where to look for statistical bias

Bias statistics can come from any number of data sources, be it survey data, administrative data, big data, etc. As well there are many types of errors that can lead to bias. Today however, we will focus on three particular areas susceptible to systematic errors which can lead to bias statistics. They are: firstt, data collection; second measurement; and third, analytics.

Data collection

Beginning with data collection, bias can be a result of systemic errors in the way the data are collected resulting in data that do not adequately represent the population you are trying to measure.

Some examples of bias include:

coverage bias,
non-response bias,
and self-selection bias.

Coverage bias

Coverage bias occurs when, due to the way in which the data collection process was designed, it excludes or includes groups that are (or are not) part of the target population. The main sources of coverage errors are:

Undercoverage, meaning a failure to include all membersof the population that should be included, and
Overcoverage, inclusion of members in the population that should not be included.

For example, a survey is trying to measure the daily spending habits of Canadians, but the questionnaire is only available on smartphones. The results of the survey will not include data from people without smartphones. And since the number of people with smartphones is smaller than the target population of all Canadians, there is a coverage bias because part of the population, those without smartphones, is not being covered by the survey.

Non-response bias

Non-response bias occurs when respondents differ from those who choose not to respond.

Some causes of non-response bias include a lack of interest in the topic. For example, people may be less likely to respond to a survey if they feel it does not interest them or benefit them personally. Sensitive topics can also lead to non-response if someone feels the questionnaire is asking questions that are too personal or sensitive.

Self-selection bias

Self-selection bias occurs when individuals who volunteer to provide data or participate in a study differ from those who do not volunteer. You might even say that self-selection bias is the exact opposite of non-response bias, even though they both contribute to inaccurate conclusions.

Measurement

The next area we will explore in our search for sources (or causes) of statistical bias is measurement. Measurement bias occurs when there are systematic errors in the way the concept of interest is measured or reported.

Some examples include:

recall bias,
social desirability bias,
leading questions and
faulty measurement tools.

Recall bias

Recall bias occurs when respondents do not remember previous events or experiences accurately or omit details. For example, a respondent may have difficulty remembering how much they paid for gas in the past month. Or, if asked about visits to the doctor in the past year, the respondent might include a visit from 15 months ago, or forget a visit from 10 months ago.

Social desirability bias

Social desirability bias occurs when participants, either consciously or sub-consciously respond to questions in an attempt to present a more positive self-image. For example, someone might over-report what they consider a "good" behavior, like the amount of exercise they do in a day or the amount of fruits and vegetables they eat, or they could under-report more socially and desirable behaviors, like smoking.

Leading questions

Leading questions occur when a survey question prompts, encourages or guides the respondent toward a previously determined or desired answer. For example, the wording "Most people think this is a great restaurant. Do you agree?" May elicit more positive responses than the more neutral alternative. "How would you rate this restaurant?"

Faulty measurement tools

Bias can occur when tools or measures used to collect data are faulty, malfunction or used inaccurately leading to systematically different measurements. For example, measurement tools such as a scale in a doctor's office, that's improperly calibrated, will consistently report incorrect weights.

Analytics

So far we have covered how errors can lead to bias in the data collection and measurement stages, but in this third and final section of the video, we will discuss analytics bias, which occurs when data analysis is conducted using non-representative data or when a model or researcher skews the results of a study towards a specific outcome.

Some examples of analytics bias include:

confirmation bias and
modelling bias.

Confirmation bias

If analysis is conducted to support a specific point of view or narrative, it may be biased, meaning it could ignore or exclude important elements that do not fit that point of view or narrative. Confirmation bias occurs when data analysts only choose data and results that agree with their hypothesis or beliefs.

Modelling bias

Bias can occur in data modelling when the data used are not representative or when the model, or algorithm, are also biased and do not accurately represent the phenomenon they seek to represent.

One example of training data not being representative is in the use of a company's historical data to staff a new position. If the algorithm is trained on data that shows successful hires and promotions at the company were mostly men, then it will learn to seek out and continue to suggest men be placed in future roles.

An example of a biased algorithm however, would be if the algorithm was programmed to pre-filter any results by excluding candidates with last names that include characters not present in the English alphabet.

Recap of Key Points

To recap what we learned in this video:

There are two main types of error: random error and systematic error,
Statistical bias refers to differences between an estimate and the true value.
And the three particular areas susceptible to errors which can lead to bias include, bias in the population covered by the data, bias in the measurement of the concepts of interest and bias in their analysis or methods used for analysis.

(The Canada Wordmark appears.)

What did you think?

Please give us feedback so we can better provide content that suits our users' needs.

Language selection

WxT Language switcher

Search and menus

WxT Search form