Section 3 – Discharge Abstract Database (DAD) Analytic File Access

Section 3 – Discharge Abstract Database (DAD) Analytic File Access (PDF, 226.11 KB)

DAD Licence

Licence Agreement for the Discharge Abstract Database (DAD) Research Analytic Files from the Canadian Institute for Health Information (CIHI)

Description of Product

1. The Discharge Abstract Database (DAD) Research Analytic Files referred to in this Agreement relate to research analytic files ('clinical and geographic') that are de-identified samples from CIHI's Discharge Abstract Database (sampled from fiscal years 2009-2010, 2010-2011, 2011-2012 and 2012-2013 contained in the DLI Collection (the"CIHI Files"). The Database (DAD) will now include samples from CIHI's Discharge Abstract Database sampled from fiscal year 2013-2014 and any future fiscal years that could be applicable.

Contact and Custodian

2.1 The Licensee hereby nominates the DLI contact as the contact person to whom all further communication shall be addressed by Statistics Canada or CIHI on any matter concerning this Agreement.

2.2 The Licensee hereby nominates the DLI contact as the designated custodian of the CIHI Files with responsibility for ensuring their proper use and custody pursuant to the terms of this Agreement.

Delivery of Product

3. Upon signature of this Agreement, Statistics Canada shall provide to the Licensee access to the CIHI Files and one copy of the related documentation.

Ownership

4. The CIHI Files and related documentation shall at all times be and remain the sole and exclusive property of CIHI, it being mutually agreed that this Agreement involves a limited licence for the use of the CIHI Files and related documentation and that nothing contained herein shall be deemed to convey any title or ownership interest in the CIHI Files or the related documentation to the Licensee.

Use of CIHI Files

4.1 Statistics Canada hereby grants to the Licensee a non-exclusive, non-assignable and non-transferable limited licence to use the CIHI Files and related documentation for statistical and research purposes. The CIHI Files and related documentation shall not be used for any other purposes without the prior written consent of CIHI.

4.2 Use of the CIHI Files and related documentation is limited to the Licensee. The CIHI Files and related documentation cannot be reproduced or transmitted to any person or organization outside of the Licensee's organization.

4.3 The Licensee shall not merge or link the records on the CIHI Files with any other databases for the purpose of attempting to identify an individual person, business or organization.

4.4 The Licensee shall not present information from the CIHI Files in such a manner that gives the appearance that the Licensee may have received, or had access to, information held by CIHI about any identifiable person, business or organization.

4.5 The Licensee shall not disassemble, decompile or in any way attempt to reverse engineer any software provided as part of the CIHI Files.

No Warranty and No Liability

5. The CIHI Files are licenced 'as-is,' and CIHI makes no representation or warranties whatsoever with respect to the CIHI Files, whether express or implied, in relation to the CIHI Files and expressly disclaims any implied warranty or merchantability or fitness for a particular purpose of the CIHI Files.

The Canadian Institute for Health Information or any of its officials employees, agents, successors and Assigns shall not be liable for any errors or omissions in the CIHI Files and shall not, under any circumstances, be liable for any direct, indirect, special, incidental, consequential, or other loss injury, damage, however caused, that you may suffer at any time by reason of your possession, access to or use of the CIHI Files arising out of the exercise of your rights or the fulfilment of your obligations under this agreement.

Publication by the Licensee

6. In any publication of any information based on the CIHI Files, the Licensee shall use the following form of accreditation:

"Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2009–2010 and 2010–2011). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information."

Condition of Use

7. Statistics Canada may modify this agreement at any time with respect to the Licensee's right to use the CIHI Files and related documentation and such modifications shall be effective immediately upon posting of the modified agreement on the Statistics Canada Website.

Term

8. This Agreement comes into force when signed by both Parties and shall continue in force until terminated in accordance herewith.

Termination

9.1 Statistics Canada may, by providing ten days written notice to the Licensee, terminate this Agreement if the Licensee fails to comply with any of the terms of this Agreement and to remedy such breach within the notice period.

9.2 In the event of termination, the Licensee must immediately return the CIHI Files and related documentation to Statistics Canada, or destroy them and certify this destruction in writing to Statistics Canada.

10. Any notice to be given to Statistics Canada or the Licensee shall be in writing and sent by registered mail or electronic mail.

11. Sections 4, 6 and 8 hereof survive the termination of this Agreement pursuant to Section 11.

Amendment

12. No amendment to this Agreement shall be valid unless it is reduced to writing and signed by the Parties hereto.

Entire Agreement

13. This Agreement constitutes the entire agreement between Statistics Canada and the Licensee with respect to the Licensee's right to use the CIHI Files and related documentation.

Appropriate Law

14. This Agreement shall be governed and construed in accordance with the laws of the province of Ontario and the laws of Canada applicable therein. The parties hereby attorn to the exclusive jurisdiction of the Federal Court of Canada.

Affirmation

I acknowledge that I have read and understand the terms and conditions under which the data product(s) can be used and that the organization will abide by them.

  • Licence Administrator (please print)
  • Academic Institution
  • Date

Section 2 – Social Policy Simulation Database and Model (SPSD/M) Access

Section 2 – Social Policy Simulation Database and Model (SPSD/M) Access (PDF, 211.74 KB)

SPSD/M Licence

Licence Agreement for the Social Policy Simulation Database and Model

This Agreement ("Agreement") is made

Between: His Majesty the King in Right of Canada, as represented by the Minister of Innovation, Science and Economic Development, having been designated as the Minister for the purposes of the Statistics Act (referred to herein as "Statistics Canada"),

And: (Name of the Other Party), (Referred to herein as the "Licensee").

In consideration of the mutual obligations, hereinafter set forth, and for good and valuable consideration, Statistics Canada and the Licensee agree as follows:

1. Definitions

1.1 "Software Product" means the computer program(s), and any related documentation, as described in Part 1 of Appendix A attached hereto.

1.2 "Use" means the execution of the Package on a computer, and includes the reading of the related documentation by automated and/or human means.

1.3 "Database" means the non-identifiable microdata and related documentation as described in Part 2 of Appendix A. Data in the Database is synthetic and contains information that has been created using data from a variety of sources

1.4 "Package" means the Software Product and the Database, collectively.

2. Grant of Licence

2.1 Statistics Canada grants to the Licensee, a non-exclusive, non-assignable and non-transferable licence to Use the Package for statistical and research purposes, subject to the terms and conditions contained in this Agreement.

2.2 Statistics Canada grants to Licensee, the licence to make copies of the Package provided that the use of these copies conforms to the terms and conditions of the Agreement.

3. Restrictions on Use

3.1 The Licensee shall not Use the Package or any part thereof to develop or derive any other software product for distribution or commercial sale. No part of the Package nor any right granted under this Agreement shall be sold, rented, leased, lent, sub-licence or transferred to any other person or organization without a separate licence.

3.2 The Licensee shall not merge or link the records in the Database with any other databases for the purpose of attempting to identify an individual person, business or organization.

3.3 The Licensee shall not present information from the Package in such a manner that gives the appearance that the Licensee may have received, or had access to, information held by Statistics Canada about any identifiable person, business or organization.

4. Publication

4.1 The Licensee may publish written reports analyzing the results of any use by the Licensee of the Package pursuant to this Agreement, provided that each such report contains the following notice: "This analysis is based on Statistics Canada's Social Policy Simulation Database and Model. The assumptions and calculations underlying the simulation results were prepared by [_?_] and the responsibility for the use and interpretation of these data is entirely that of the author(s)."

4.2. The Licensee may make oral statements, to the media or otherwise, analyzing the results of any use by the Licensee of the Package pursuant to this Agreement provided that the Licensee ensures that each statement includes the notice set out in Paragraph 4.1.

5. Delivery of Products and Services

5.1 Upon execution of this Agreement by the Licensee, Statistics Canada shall deliver to the Licensee:

5.1.1 One (1) copy of the Package described in Appendix A attached.

5.2 Statistics Canada may from time to time deliver to the Licensee enhancements to the Software Product developed by Statistics Canada, and all such enhancements so delivered shall be deemed to form part of the Package for purposes of this Agreement.

6. Installation

6.1 Installation of the Package on the computer system of the Licensee shall be the responsibility of the Licensee in accordance with the conditions set out in Paragraph 2.1.

7. Term and Effective Date

7.1 This Agreement is effective from the date of execution by the parties and shall continue until terminated in accordance with this Agreement.

8. Termination

8.1 Either Party may terminate this Agreement, without cause, upon thirty (30) days written notice. The termination shall become effective at the date mutually agreed upon by both Parties.

8.2 Statistics Canada may terminate this Agreement by written notice to the Licensee if the Licensee breaches any condition of this Agreement. Such termination by Statistics Canada shall be in addition to and without prejudice to such rights and remedies as may be available to Statistics Canada including injunction and other equitable remedies.

8.3 Upon termination by either Statistics Canada or the Licensee under 8.1 or 8.2 above, the Licensee shall immediately:

8.3.1 Cease using the Package, and;

8.3.2 Return to Statistics Canada all copies of the Package or destroy all copies thereof in the Licensee's possession, as Statistics Canada may request, and;

8.4 Within ten (10) days thereafter, Licensee must provide to Statistics Canada, with written notice, a sworn statement confirming that the Licensee has complied with the foregoing.

9. Notice

9.1 Any written notice provided for in this Agreement shall be deemed to be effectively given if hand-delivered or sent by pre-paid registered mail, addressed as follows:

For Statistics Canada:

Data Access Division
Data Liberation Initiative
Statistics Canada
100 Tunney's Pasture Driveway
10th Floor, Section L
Ottawa, ON, K1A 0T6

Any notice hand delivered shall be deemed delivered, in the case of the Licensee, on the day it is left with the official set out above at the address above and in the case of Statistics Canada on the day it is left with the official set out above at the address set out above. Any notice given by registered mail shall be deemed delivered on the day the postal receipt is acknowledged by the other Party.

10. Ownership

10.1 The Licensee acknowledges that the Package and all intellectual property rights relating to the Package are owned by Statistics Canada subject to the rights of third parties therein. Nothing contained in this Agreement shall be deemed to convey to the Licensee any title or ownership in the Package.

10.2 The Licensee agrees that any additional Package components including but not limited to training and procedural materials, shall remain the exclusive property of Statistics Canada.

11. Assignment

11.1 This Agreement shall not be assigned in whole or in part by the Licensee without the prior written consent of Statistics Canada, and any assignment made without such consent shall be void and of no effect.

12. Warranties and Disclaimers

12.1 The Package is provided "as is". Statistics Canada makes no other warranties, guarantees or representations, express or implied, including but not limited to warranties of merchantability, fitness for intended use, and fitness for any particular purpose, with respect to the Package.

13. Waiver

13.1 The waiver or failure of Statistics Canada to exercise in any respect any rights provided for in this Agreement shall not be deemed a waiver of such right, nor shall it preclude the subsequent exercise of such right or the exercise of any other right.

14. Liability

14.1 Statistics Canada shall not be liable to the Licensee for any design, performance, other fault or inadequacy or unauthorized use of the Package pursuant hereto or for damages of any kind arising out of or in any way related to or connected with such fault, inadequacy or unauthorized use of the Package.

15. Indemnification

15.1 The Licensee shall at all times indemnify and save harmless Statistics Canada from and against all claims, losses, damages, costs, expenses, actions and other proceedings made, sustained, brought, prosecuted, in any manner based upon, occasioned by or attributable to the Use of the Package provided to the Licensee pursuant to this Agreement.

16. Survival of Rights

16.1 The sections of this agreement regarding warranties and disclaimers, liability indemnification, and any other provisions which by their nature survive the termination or expiry of this Agreement shall survive expiration or termination of this Agreement and shall bind the Parties hereto.

17. Invalidity

17.1 The invalidity of any particular provision of this Agreement shall not affect any other provision thereof, and the Agreement shall be construed as if such invalid provision were omitted.

18. Amendment

18.1 No amendment of this Agreement nor waiver of any of the terms and conditions contained therein shall be valid unless it is written and signed by each Party.

19. Conflict of Interest

19.1 It is a term of this Agreement that no former public office holder in Canada, who is not in compliance with the post-employment provisions of the Conflict of Interest and Post Employment Code for Public Office Holders, shall derive a direct benefit from this Agreement.

20. Entire Agreement and Appropriate Law

20.1 This Agreement, including all Appendices, constitutes the entire Agreement between the Parties with respect to the subject matter hereof and supersedes all previous negotiations, communications and other Agreements between them.

20.2 The headings preceding the paragraphs of this agreement are for convenience only, are not a part of this Agreement, and do not in any way limit or amplify the terms and conditions of this Agreement.

20.3 This Agreement shall be governed and construed in accordance with the laws in force in the Province of Ontario, Canada.

21. Use of Licensee's Name

21.1 The Licensee authorizes Statistics Canada to use, for the duration of this licence, its name in any promotional material which may be developed for the Package, provided that Statistics Canada has furnished the Licensee a copy of the material thirty (30) days prior to such use and has secured the licensee's written approval.

Appendix A

Part 1: Software Product

Product Name: Social Policy Simulation Model

Product Description: The Social Policy Simulation Model (SPSM) is a tool designed to assist those interested in analysing the financial interactions of governments and individuals in Canada. It can help one to assess the cost implications or income redistributive effects of changes in the personal taxation and cash transfer system. The model reads the Social Policy Simulation Database (SPSD). The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results.

Part 2: Database

Product Name: Social Policy Simulation Database

Product Description: The Social Policy Simulation Database (SPSD) is a non-confidential, statistically representative database of individuals in their family. It is used in conjunction with the Social Policy Simulation Model (SPSM).

Affirmation

I acknowledge that I have read and understand the terms and conditions under which the data products are supplied. I agree to abide by these conditions and to take all reasonable measures required to enforce and administer them within my Academic Institution.

  • Licence Administrator (please print)
  • Academic Institution
  • Signature
  • Date
  • DLI contact (please print)
  • Signature
  • Date

Section 1 – Postal codeOM Conversion File (PCCF) Access

Section 1 – Postal codeOM Conversion File (PCCF) Access (PDF, 234.17 KB)

CCF Licence

End-use Licence Agreement for Postal codeOM Conversion File, Postal Codes OM by Federal Ridings File and Postal codeOM Conversion File Plus ("data product")

The Government of Canada (Statistics Canada) is the owner or a licensee of all intellectual property rights in this data product. With your payment of the requisite fee, you (hereinafter referred to as 'the Licensee') are granted a non-exclusive, non-assignable and non-transferable licence to use this data product subject to the terms below. This licence is not a sale of any or all of the rights of the owner(s). The data product includes information taken with permission from © Canada Post Corporation. All rights reserved. Information taken with permission from Canada Post Corporation does not form part of the Government of Canada open data portal.

Terms of use

  1. The Licensee shall not lend, rent, lease, sublicense, distribute, make public, transfer or sell any part of the data product nor any right granted under this agreement to any person outside the licensed organization or to any other organization.
  2. The Licensee shall not disassemble, decompile or in any way attempt to reverse engineer any part of the data product.
  3. The Licensee shall not use any part of the data product to develop or derive any other data product or data service for external distribution or commercial sale.
  4. The Licensee shall not use the data product other than for the purpose of matching Postal codeOM data to geography in accordance with Appendix 'A'.
  5. The Licensee shall not use the data product for the following mail preparation purposes:
    • addressing mail;
    • presorting addressed mail;
    • preparing unaddressed mail by householder count for delivery
  6. The Licensee agrees not to merge or link the data product with any other databases in such a fashion that gives the appearance that the Licensee may have received, or had access to, information held by Statistics Canada about any identifiable individual, family, household, organization or business.
  7. The Licensee is granted rights of use of the content of this data product for the purpose expressly described in Appendix 'A', attached, and for no other purpose. In such cases, the source of the data must be acknowledged in all documents and communications by providing the following source citation at the bottom of each table and graph:
    Source: (or Adapted from) Statistics Canada (Select either Postal codeOM Conversion File or Postal CodesOM by Federal Ridings File and Postal CodeOM Conversion File Plus (August 2018) which is based on data licensed from Canada Post Corporation.
  8. The Licensee may publish an extract of up to 1% of the data product pursuant to Appendix 'A'. This permission includes the use of the extracted data in support of analyses and in reporting of results and conclusions. The Licensee shall obtain approval from Statistics Canada before publishing extract of the data product in excess of 1%.
  9. The Licensee is authorized to provide the data product to contractors/consultants only for the purpose of "providing data manipulation and consulting services exclusively to Licensee. Upon completion of work, the contractor/consultant must i) return all data products to Licensee, and ii) delete the data product from their systems and premises." Contractors or consultants may not use the data product or derived products for their own purposes or to offer services to third parties.
  10. The Licensee must display the following disclaimer on each search access point if the data product is used in a search engine application pursuant to Appendix 'A': "this tool does not validate Postal codesOM".
  11. The Licensee agrees to permit Statistics Canada to provide to Canada Post upon request, a copy of this signed agreement for audit purposes only.

Term

This licence is effective as of the date of signature and shall automatically terminate if any of the terms and conditions are violated.

Termination

Any violation of this licence renders it void and of no effect. This agreement will terminate automatically without notice if the Licensee fails to comply with any of the terms of this agreement.

Statistics Canada or the Licensee may terminate this agreement without cause upon thirty (30) day written notice or at a time otherwise agreed upon by both parties. In the event of termination, the Licensee must immediately return the data product to Statistics Canada or destroy it and certify this destruction in writing to Statistics Canada.

Warranties and disclaimers

This data product is provided 'as-is,' and Statistics Canada and its licensors make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. In no event will Statistics Canada or its licensors be liable for any direct, special, indirect, consequential or other damages, however caused.

Indemnification

The Licensee shall at all times indemnify and save harmless Statistics Canada its officers, servants and agents from and against all claims, demands, losses, damages, costs, actions, or other proceedings made, sustained, brought or prosecuted by any person, in any manner, based upon, occasioned by or attributed to any injury, infringement, or damage arising out of the use of the data product, or arising out of a breach, by the Licensee, of any of its obligations under this agreement.

Acceptance of terms

It is your responsibilityto ensure that your use of this data product complies with these terms. Any infringement of Statistics Canada's rights may result in legal action. Any use whatsoever of this data product shall constitute your acceptance of the terms of this agreement.

For further information please contact:

Statistics Canada
Statistical Registers and Geography Division
PCCF-CCP
170 Tunney's Pasture Driveway
Jean Talon Building, 3rd floor, Section D3
Ottawa, Ontario
Canada
K1A 0T6
Email: statcan.pccf-fccp.statcan@statcan.gc.ca

Appendix A

Approved Postal Code Data Matching Uses for DLI Accredited Canadian Post-Secondary Institution

Matching Postal codeOM data to geography for:

  • Teaching and learning purposes.
    E.g.: Students can download PCCF on their laptop to do their assignments. This includes projects, maps, analytical papers, etc. Faculty can download and use the PCCF in teaching exercises.
  • Research purposes
    E.g.: Can be used in analysis to write articles that are published in journals. The data is not shared but the results are published. his also includes thesis for Masters or Doctorate where results are required to be public.
  • Planning purposes – where the institution can use the information in planning student recruitment activities or find out where these students are coming from.

Affirmation

I acknowledge that I have read and understand the terms and conditions under which the data product(s) can be used and that the organization will abide by them.

  • Licence Administrator (please print)
  • Academic Institution
  • Signature
  • Date
  • DLI contact (please print)
  • Signature
  • Date

Wholesale Trade Survey (monthly): CVs for total sales by geography - February 2021

Wholesale Trade Survey (monthly): CVs for total sales by geography - February 2021
Geography Month
202002 202003 202004 202005 202006 202007 202008 202009 202010 202011 202012 202101 202102
percentage
Canada 0.7 0.6 0.8 0.8 0.7 0.7 0.7 0.7 0.5 0.6 0.6 0.7 0.7
Newfoundland and Labrador 0.3 1.2 0.7 0.5 0.1 0.2 0.4 0.3 0.3 0.4 0.4 0.7 0.5
Prince Edward Island 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Nova Scotia 2.0 2.8 3.3 4.0 2.3 1.5 1.8 1.7 2.4 3.4 7.5 1.5 1.1
New Brunswick 1.2 1.3 2.1 3.3 1.9 2.1 4.2 3.4 2.6 2.8 2.9 2.8 2.5
Quebec 2.1 1.6 2.4 2.0 1.9 1.8 2.1 2.0 1.5 1.5 1.7 1.7 1.8
Ontario 0.9 1.0 1.2 1.1 1.1 1.1 0.9 0.9 0.8 0.9 0.8 1.2 1.1
Manitoba 0.8 1.0 2.9 2.8 1.2 1.2 1.8 2.3 1.4 1.4 1.8 1.6 2.4
Saskatchewan 0.6 0.5 1.2 0.7 0.7 1.1 1.6 0.6 0.8 0.7 0.9 0.8 0.8
Alberta 0.9 1.2 2.9 2.9 2.3 2.3 1.8 3.3 1.3 1.1 1.7 1.0 1.1
British Columbia 1.6 1.5 1.3 1.7 1.6 1.3 1.9 1.8 1.4 1.4 1.4 1.3 1.3
Yukon Territory 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Northwest Territories 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Nunavut 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Retail Commodity Survey: CVs for Total Sales (fourth quarter 2020)

Retail Commodity Survey:  CVs for Total Sales (fourth quarter 2020)
NAPCS-CANADA Quarter
2019Q4 2020Q1 2020Q2 2020Q3 2020Q4
Total commodities, retail trade commissions and miscellaneous services 0.50 0.49 0.53 0.55 0.66
Retail Services (except commissions) [561]  0.50 0.49 0.53 0.54 0.65
Food at retail [56111]  0.67 0.52 0.63 0.59 0.72
Soft drinks and alcoholic beverages, at retail [56112]  0.45 0.43 0.49 0.48 0.58
Cannabis products, at retail [56113] 0.05 0.00 0.00 0.00 0.00
Clothing at retail [56121]  0.65 0.70 1.16 0.82 1.46
Footwear at retail [56122]  0.97 1.19 2.94 1.92 1.77
Jewellery and watches, luggage and briefcases, at retail [56123]  1.69 5.93 14.50 9.52 1.89
Home furniture, furnishings, housewares, appliances and electronics, at retail [56131]  0.64 0.63 0.61 0.60 0.60
Sporting and leisure products (except publications, audio and video recordings, and game software), at retail [56141]  1.79 2.61 1.94 3.04 2.06
Publications at retail [56142] 6.47 7.22 9.41 7.54 6.81
Audio and video recordings, and game software, at retail [56143] 3.09 3.65 2.66 5.76 6.11
Motor vehicles at retail [56151]  1.80 1.65 1.98 1.88 2.71
Recreational vehicles at retail [56152]  3.48 2.83 4.43 2.78 5.45
Motor vehicle parts, accessories and supplies, at retail [56153]  1.28 1.41 1.46 1.53 1.62
Automotive and household fuels, at retail [56161]  2.07 1.96 3.49 2.20 2.21
Home health products at retail [56171]  2.72 2.53 2.59 2.49 3.37
Infant care, personal and beauty products, at retail [56172]  2.61 2.71 3.30 2.71 2.86
Hardware, tools, renovation and lawn and garden products, at retail [56181]  1.89 1.38 1.93 1.23 1.43
Miscellaneous products at retail [56191]  2.17 2.04 2.69 2.31 2.23
Total retail trade commissions and miscellaneous services Footnotes 1 1.42 1.41 1.55 1.63 1.63

Footnotes

Footnote 1

Comprises the following North American Product Classification System (NAPCS): 51411, 51412, 53112, 56211, 57111, 58111, 58121, 58122, 58131, 58141, 72332, 833111, 841, 85131 and 851511.

Return to footnote 1 referrer

Retail Commodity Survey: CVs for Total Sales (January 2021)

Retail Commodity Survey: CVs for Total Sales (January 2021)
NAPCS-CANADA Month
202010 202011 202012 202101
Total commodities, retail trade commissions and miscellaneous services 1.23 0.58 1.17 0.69
Retail Services (except commissions) [561] 1.21 0.58 1.15 0.69
Food at retail [56111] 1.25 0.68 0.90 1.22
Soft drinks and alcoholic beverages, at retail [56112] 0.76 0.57 0.59 0.77
Cannabis products, at retail [56113] 0.05 0.00 0.00 0.00
Clothing at retail [56121] 1.61 1.73 1.52 1.78
Footwear at retail [56122] 1.73 1.86 1.94 3.53
Jewellery and watches, luggage and briefcases, at retail [56123] 6.60 2.11 2.94 5.53
Home furniture, furnishings, housewares, appliances and electronics, at retail [56131] 0.70 0.64 0.70 0.93
Sporting and leisure products (except publications, audio and video recordings, and game software), at retail [56141]; 2.74 1.72 1.67 2.64
Publications at retail [56142] 6.44 5.91 7.64 9.86
Audio and video recordings, and game software, at retail [56143] 6.87 5.72 6.88 10.74
Motor vehicles at retail [56151] 4.73 2.04 5.14 2.12
Recreational vehicles at retail [56152] 4.42 5.75 6.21 5.23
Motor vehicle parts, accessories and supplies, at retail [56153] 2.47 1.37 2.99 1.72
Automotive and household fuels, at retail [56161] 2.40 2.27 2.26 2.41
Home health products at retail [56171] 3.32 3.70 3.44 3.65
Infant care, personal and beauty products, at retail [56172]; 3.35 3.00 3.14 2.18
Hardware, tools, renovation and lawn and garden products, at retail [56181] 1.36 1.51 1.69 1.74
Miscellaneous products at retail [56191] 2.77 2.36 2.12 2.58
Total retail trade commissions and miscellaneous servicesFootnote 1 2.38 1.47 2.43 1.42

Footnotes

Footnote 1

Comprises the following North American Product Classification System (NAPCS): 51411, 51412, 53112, 56211, 57111, 58111, 58121, 58122, 58131, 58141, 72332, 833111, 841, 85131 and 851511.

Return to footnote 1 referrer

Statistics 101: Exploring measures of dispersion

Catalogue number: 892000062020003

Release date: May 3, 2021 Updated: February 7, 2023

How do we describe data in just a few simple terms? Two really important features of a dataset are the location of the centre―or balance point―and the size of the spread.

Try thinking of it this way: if we were to hold the data in our hands, would they be densely concentrated in one spot like a golf ball, or all over the place like cotton candy? The balance point of data is called the central tendency. But, the size of region the data cover and how spread out it is―is called dispersion. In this video, we will explore the concept of dispersion. However, as a prerequisite to this video, we highly recommend first watching our video called "Statistics 101: Exploring measures of central tendency" as some concepts such as the mean will be discussed in this video.

Data journey step
Explore, clean, describe
Data competency
  • Data exploration
  • Data interpretation
Audience
Basic
Suggested prerequisites
Statistics 101: Exploring measures of central tendency
Length
12:07
Cost
Free

Watch the video

Statistics 101: Exploring measures of dispersion - Transcript

Statistics 101: Exploring measures of dispersion - Transcript

(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Statistics 101 Exploring measures of dispersion")

Statistics 101: Exploring measures of dispersion

How do we describe data in just a few simple terms? Two really important features of a dataset are the location of the center, or balance point, and the size of the spread. Try thinking of it this way: if we were to hold data in our hands, would they be densely concentrated in one spot like a golf ball, or all over the place like cotton candy? The balance point of data is called the central tendency. But, the size of the region the data covers and how spread out it is, is called dispersion. In this video, we will explore the concept of dispersion. However, as a prerequisite to this video, we highly recommend first watching our video called "Exploring Measures of Central Tendency" as some concepts such as the mean will be discussed in this video.

Learning goals

By the end of this video, you should have a basic understanding of such measures of dispersion as range, interquartile range and standard deviation. This video is intended for learners looking to gain a basic understanding of the concept of dispersion, also called variability, what it means and some key related concepts that are used to explore data.

Measures of dispersion

In statistics, dispersion is the extent to which a distribution is stretched or squeezed. Imagine you are expecting a package in the mail. Usually, the mail arrives anytime between 8 a.m. and 4 p.m., which means, if you want to be there when it arrives, your whole day may be spent at home waiting. But, if you know that the mail usually arrives between 8 and 10 a.m., you have a better indication when to expect it. Measures of dispersion also give an indication of how well the measures of central tendency, such as the mean, describe the distribution of values in the dataset. This is useful when using sample data to draw conclusions about behaviors and characteristics of the entire population. Measures of dispersion are also important because they help us make informed decisions about how to analyze the data and how much uncertainty it contains.

Steps of a data journey

(Text on screen: Supported by a foundation of stewardship, metadata, standards and quality)

(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)

This diagram is a visual representation of the data journey from collecting the data to cleaning, exploring, describing and understanding the data, to analyzing the data and lastly, to communicating with others the story the data tell.

Step 2: Explore, clean and describe

(Diagram of the Steps of the data journey with an emphasis on Step 2 - explore, clean and describe)

Exploring measures of dispersion is part of the explore, clean and describe step in the data journey.

What does the spread of data look like?

(Graph representing the number of pizza deliveries as a function of delivery times in a bell shape Normal distribution)

Before we begin, let's take a quick look at some common ways that data are spread, that is, are clustered together or spread out. The distribution of data is often represented using scatter plots or histograms. Their shape show the spread of the dataset. Data can be represented graphically in a symmetrical, bell shape, as can be seen here, where most of the values are clustered in the middle between 20 and 40 minutes, as we see here in a graph of pizza delivery times, while some pizza deliveries take less time and others take longer. This is what is called a normal distribution, and we will talk more about that later.

(2 seperate graphs on the left and right representing a Normal distribution that is positively and negatively skewed, respectively)

If the data set is not symmetrical, but instead has more values located to the left or right of the graph, the symmetrical shape becomes skewed, which creates a longer tail on one side or another. A dataset is considered to be skewed in the direction of the longer tail. When data are positively skewed, there is a large number of values located on the left side or "low end" of the graph, causing a tail stretched out to the right. When data are negatively skewed, we see a larger number of values located in the high end of the graph and the tail stretched out toward the left hand or low section of the graph.

Measures of dispersion

(Flowchart presenting the three common measures of dispersion: Range, Interquartile range and Standard deviation)

Now back to our measures of dispersion. In order to determine the dispersion, three commonly used measures are the range, the interquartile range and the standard deviation. The next few slides look at each individually.

Range

The range of the difference between the largest and the smallest values in a dataset. It provides a quick and easy measure of the spread of these values. The range is best used with data that do not have extreme values. Like our package delivery. If we know the package will be delivered between sometime at 10 a.m. and noon, we feel safe making plans to do other things in the day. This kind of range is a very useful information. However, if we are told the package will arrive between 8 a.m. and 8 p.m., well, how useful is this information really? How confident would you feel stepping out to run a quick errand at any point in the day and not missing your delivery? Probably not very.

Knowing that the range is the distance between the largest value and the smallest value, we will now put that into the form of an equation. The range is simply the highest value minus the lowest value. In this example, the lowest value is 1, while the highest value is 7. Therefore, the range is 7 minus 1, which is 6. Here, the range is an appropriate measure because the data points are clustered together.

Example

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 80%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

Let's look at an example. Here we have exam scores from a group of 12 students. The highest exam score is 95%. To determine the range, we subtract the lowest exam score, which is 80%. This makes the range 15% which is quite narrow. An advantage of using the range as a measure of dispersion is that it is easy to calculate.

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 10%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

Now, let's look at a similar example, but with one major difference. Here, we have exam scores from the same group of 12 students. The highest exam score is again 95%. To determine the range, we subtract the lowest exam score, which is now 10%. This makes the range 85%. This is a very wide spread. Upon closer inspection, we see one student, John, did quite poorly on the exam, while everyone else did very well. This makes John's score an outlier because 11 out of the 12 students scored between 85% and 95%. His single score is the main cause of this wide spread. And, because the range is the comparison of the smallest to the largest values, we see here, how the range can be a misleading measure of dispersion when there are outliers in the data.

Interquartile range

Similar to the range is the interquartile range, the interquartile range is also the distance between the largest and the smallest value, but only amongst the middle 50 percent of the whole distribution. This makes it slightly more stable than the full range because it does not consider the bottom and top 25% of the data helping insulate against the impact of most outliers.

Well it's true that the interquartile range is slightly more stable than the full range, it is important to know that when using the interquartile range as a measure of dispersion you will lose detail about what is happening at the ends of your distribution.

How to find the interquartile range?

(Text: Dataset= 3, 1, 8, 5, 3, 6, 4, 8, 6, 7)

To find the interquartile range, first, you need to order the data from least to greatest. After placing the 10 numbers that make up the dataset on this slide in a list from smallest to largest, and using the knowledge you obtained in this video on measures of central tendency, you would find the median of the entire dataset, which is the midpoint when you order all observations from smallest to largest and in this case, because there is an even number of observations, we add the two middle numbers and divide by two, which is 5.5. By calculating the median, we are able to break the data into two halves. This allows us to move on to our next step.

Next, you would again calculate the median, but this time for both the upper and lower halves of the data, which would be 3 for the lower half and 7 for the upper half. Then, you subtract the lower median from the upper. The interquartile range is the difference between those two numbers, which in this case equals 4. It is important to note that this method works well for simple and short list of values. But for more complicated datasets, Q1 and Q3 can easily be obtained using software such as Excel.

Knowledge check

(Table presenting the time it takes for pizza to be delivered for each household. the columns, from left to right, are titled: Household | Minutes taken for pizza to be delivered. The first line to the last contains the following: 1 | 15; 2 | 20; 3 | 25; 4 | 30; 5 | 30; 6 | 35; 7 | 35; 8 | 40; 9 | 45; 10 | 50)

Your turn. Imagine you have ordered a pizza and they tell you it should take around 30 minutes to be delivered. Then imagine 9 other households have done the same thing. What in this case does around "30 minutes" really mean? Here we have a table showing exactly how long each of the ten households had to wait to receive their pizza. To test your knowledge so far. Pause the video and try to calculate the range of time, in minutes, each household should expect their pizza to arrive. Then, calculate the interquartile range. Pause the video now and restart once you are ready to check your answers. Did you get 35 for the range and 15 for the interquartile range? If so, good for you! Now we can move on to our next measure of dispersion: standard deviation.

Standard deviation

(Table presenting the exam scores of students. the columns, from left to right, are titled: # | Student | Exam score. The first line to the last contains the following: 1 | John | 80%; 2 | Amy | 85%; 3 | Tony | 85%; 4 | Moe | 86%; 5 | Ali | 87%; 6 | Sofia | 88%; 7 | Jose | 90%; 8 | Maria | 90%; 9 | Hugo | 92%; 10 | Louise | 94%; 11 | Sylvain | 95%; 12 | Jade | 95%)

So far, this video has explained how both the range and interquartile range can give you a good idea of the median or average value in a dataset. But they do not tell you how close the rest of the numbers in the dataset are to that median. This can be very important information to know. For example, going back to a class of students. When the teacher adds up everyone's score, she gets a total of 907. And when she divides that number by the number of scores, which is 12, she gets a mean score of 76%. 76% could be a good score, but is everyone performing at that level? In a class of 12, it is not that difficult to see that a few are struggling. But what about in a class of 200?

(2 seperate graphs on the left and right representing a bell shaped Normal distribution with a low and high standard deviation, respectively)

The standard deviation tells you how spread out measurements for a group of values are from the average or mean. It is a number which can be quickly and easily calculated using software such as Microsoft Excel and is considered the most robust of the three different measures of dispersion. Therefore, it is the measure used most often when doing statistical analysis. A low standard deviation means that most of the numbers are close to the mean. So when it comes to a teacher determining how well each of her students is performing, a low standard deviation would tell her that most of her students are performing at around the same level. A high standard deviation would tell her that not everyone is performing at around the same level. So, if the class average were high, a high standard deviation would mean that some students are still struggling.

(2 seperate graphs on the left and right representing a bell shaped Normal distribution with a low and high standard deviation with their means remaining at the center of the distribution, respectively)

But in situations where you just observe and record data, a high standard deviation isn't necessarily a bad thing, it just reflects a large amount of variability in the group that is being studied. For example, if you look at all salaries with any large company, including everyone from the co-op students to the CEO, the standard deviation may be very high. On the other hand, if you narrow the group down by looking only at the co-op students, the standard deviation is lower, because the individuals within this group have salaries that are more similar. The second dataset isn't better, it simply has less variability.

Standard deviation and the Normal distribution

The Normal distribution is one example of a distribution that could help you better understand the concept of standard deviation. In the context of data, a distribution is a mathematical model that mimics how the data points are distributed or dispersed. We often visualize the Normal distribution as a curve shaped like a hilltop or bell. It represents the presence of small and big data points on the left- and right- hand side of the curve, respectively. While most of the data points are somewhere in the center, where the summit is found. In the Normal distribution, the data points fall in a symmetrical pattern that looks like the curve you see on this slide, which is called a bell curve.

Normal distribution

The Normal distribution is symmetrical, which causes the mean, median and mode to be the same number. These are represented by the line down the center of the bell curve.

(Graph representing a Normal distribution with the mean = median = mode at the sommet of the distribution)

For the standard normal distribution, the dispersion measurement method we call standard deviation, or SD on the slide, has some pretty neat properties. It tells us where to expect the data points to be in the distribution. Sampling theory and the Normal distribution tell us that approximately 68% percent of the data values in the whole population will fall between the mean +/-1 standard deviation. Similarly, approximately 95% of the data values will fall within the mean +/- 2 times the standard deviation, and approximately 99.7% of the data values will fall within the mean +/- 3 times the standard deviation.

Recap of key points

Measures of dispersions provide a quantitative indication of the degree to which data values are spread out or clustered together. In this video, we looked at three common measures of dispersion: range, interquartile range and standard deviation. And we learned that sometimes data can be bell shaped, with most values clustered in the middle, which is often called a Normal distribution.

(The Canada Wordmark appears.)

What did you think?

Please give us feedback so we can better provide content that suits our users' needs.

Statistics 101: Correlation and causality

Catalogue number: 892000062021002

Release date: May 3, 2021 Updated: December 1, 2021

In this video, you will learn how to prove the existence of a relationship, or lack thereof, between two variables. This is a very important part of data analysis.

By the end of this video, you will learn the answers to the following questions:

  • What is correlation?
  • How can you measure, quantify, or interpret correlation when analyzing your data?
  • What is causality?
  • And finally, what are the differences between the two?
Data journey step
Analyze, model
Data competency
  • Data analysis
  • Data driven decision making
  • Data interpretation
  • Data visualisation
Audience
Basic
Suggested prerequisites
N/A
Length
17:27
Cost
Free

Watch the video

Statistics 101: Correlation and causality - Transcript

Statistics 101: Correlation and causality - Transcript

(The Statistics Canada symbol and Canada wordmark appear on screen with the title: "Statistics 101: Correlation and causality")

Statistics 101: Correlation and causality

This video is intended for viewers who wish to gain a basic understanding of correlation and causality. As a prerequisite, before beginning this video, we highly recommend having already completed our videos titled "What is Data" and "Types of Data".

Learning goals

By the end of this video, you will learn the answers to the following questions: What is correlation? How can you measure, quantify or interpret correlation when analyzing your data? What is causality? And finally, what are the differences between the two?

Steps of a data journey

(Diagram of the Steps of the data journey: Step 1 - define, find, gather; Step 2 - explore, clean, describe; Step 3 - analyze, model; Step 4 - tell the story. The data journey is supported by a foundation of stewardship, metadata, standards and quality.)

This diagram is a visual representation of the data journey, from collecting the data to cleaning, exploring, describing and understanding the data to analyzing the data, and lastly, to communicating with others the story the data tell.

Step 3 and 4: Analyze, model and tell the story

(Diagram of the Steps of the data journey with an emphasis on Step 3 - analyze, model and Step 4 - tell the story)

Correlation and causality fall under the final two steps of the data journey: analysis and modeling, and telling the story.

Patterns and relationships

(Image combining a hockey stick and a toilet that equals a Stanley Cup with a question mark)

Have you ever noticed the way the human mind really likes patterns? So much so in fact, that the mind will often create patterns. When two variables appear to be so closely associated, it can seem that one is dependent on the other. For example, Ottawa Senators hockey player Bruce Gardiner, was so superstitious, he was convinced the only way he could break the occasional slump in his performance was to dunk his hockey stick in a toilet bowl. Superstitions like this are a great example of how the brain likes to perceive relationships between two things, even when in reality, no such relationships exist. In this video, you will learn how to prove the existence of a relationship, or lack thereof, between two variables. This is a very important part of data analysis.

Correlation in data analysis

In the world of data, correlation refers to the existence of a relationship between two variables. Correlation plays a big part in data analysis. When studying a potential relationship between two variables, it is important to ask yourself the following questions: Does a relationship exist between the two variables? If so, is the relationship positive or negative? What is the strength of this relationship? Is it a strong correlation, a weak correlation, or somewhere in the middle? Correlation can exist between all types of variables, but in statistics, correlation can only be calculated for numeric variables.

What is correlation?

(Table containing data on the change in water temperature in a kettle over time)

Let's start by talking about correlation in everyday life. When we say two or more things are correlated, this means there is a mutual relationship between them. This relationship can be either positive or negative. In a positive correlation, the values of the two related items move in the same direction. Take a kettle full of water, for example: the longer the kettle is on, the hotter the temperature of the water inside the kettle will get. In a negative correlation, the values move in opposite directions. Meaning as one variable increases, the other decreases, and vice versa. For example, imagine you've taken a freshly brewed cup of tea outside on a winter day. The more time you spend outside, the colder your tea will become. In this case, as the time variable increases, the temperature decreases.

Visualizing our data

(Scatter plot visualizing data from the previous slide on water temperature in a kettle over time)

Using a scatterplot is an effective way to show the relationship between two different variables. Here, we used Microsoft Excel to plot the seven points in the table from the previous slide. You can do the same in many other spreadsheet applications. The number of seconds the water is in the kettle or plotted along the horizontal x-axis. And the water temperature is plotted along the vertical y-axis. We can clearly see here that, as the x value increases, so do the Y values. This verifies that we have a strong positive correlation.

(scatter plot visualizing the water temperature slide data in a kettle over time with a trend line intercepting the data)

This positive correlation is more clearly seen with the addition of a linear trend line. A trend line is a straight line we draw over the data which gets as close as possible to all of the data points. This can be automatically generated using your choice of software. As shown in the scatter plot, it provides an even clearer visualization, which allows us to see how strongly our variables are correlated. In this example, the line is very obviously trending upwards, which represents a positive correlation. If the line was trending downwards, it would represent a negative correlation.

Measuring correlation

For numeric variables, correlation is measured by a correlation coefficient. Where a scatter plot or trend line can help you visualize your data, a correlation coefficient is a measure of the strength of the linear relationship between two variables and is represented by "r". The value of r is always between a minimum of -1 and a maximum of 1. The correlation coefficient, or r, can be calculated easily in Excel by using the Pearson function. This function is available in multiple spreadsheets or statistical applications. Use the one you know and trust!

When r is equal to 1, we are saying that two variables have a perfectly positive relationship, meaning, the two variables always increase or decrease together. When r is equal to -1, the variables have a perfectly negative relationship. This would mean that one variable always increases while the other one decreases. Finally, when r is equal to zero, there is no linear relationship between the two variables.

Interpreting the correlation coefficient

(Table containing information on the interpretation of the value of the correlation coefficient. The columns, from left to right, are named as follows: Value of r | Correlation | Direction | Force. From the first to the last line: 1 | Yes | Positive | Perfect; 0.99 to 0.6 | Yes | Positive | Strong our very strong; 0.59 to 0.20 | Yes | Positive | Low to moderate; 0.19 to -0.19 | No | - | -; -0.2- to -0.59 | Yes | Negative | Low to moderate; -0.6- to -0.99 | Yes | Negative | Strong or very strong; -1 | Yes | Negative | Perfect)

The correlation coefficient, or r, provides information about the existence, direction and strength of a relationship between two variables. In reality, and r value is rarely equal to exactly -1 or 1. This table provides general guidelines about the strength of a relationship between two variables. If an r value is -0.6 or lower, we have a strong negative relationship. Likewise, if its value is 0.6 or higher, we have a strong positive relationship. If an r value is between -0.59 and -0.2, we have a weak negative relationship. Likewise, if its value is between 0.2 and 0.59, we have a weak, positive relationship. Finally, if the correlation coefficient is between -0.19 and 0.19, we do not have enough evidence to say that the two variables are correlated.

Example 1

(Table containing data on the change in water temperature in a kettle over time. the columns, from left to right, are named as follows: Time in the kettle (seconds) | Water temperature (Celsius). From the first line to the last: 30 sec | 20 C; 60 sec | 35 C; 90 sec | 50 C; 120 | 65 C; 150 | 80 C; 180 sec | 90 C; 210 sec | 100C;)

Let's go back to our example of water boiling in a kettle. This data table provides the temperature of water in a kettle at seven equally spaced moments in time. After the first 30 seconds, the water is at a temperature of 20 degrees Celsius. At the final moment, the water has reached its boiling point of 100 degrees Celsius. Using the value of r, we can prove there is a positive correlation between time and temperature through both the correlation coefficient and data visualization.

Calculating the correlation coefficient

(Table containing the same data as the previous slide)

(Scatter plot with a trend line viewing data from the same table)

(Text: Use Pearson function --> r-0.997)

As we mentioned earlier, the correlation coefficient, or r, can be calculated easily by using the Pearson function. The values in the first column represent the first variable: number of seconds spent in the kettle. The values in the second column represent the water temperature at each point in time. Here, we see that the r value turns out to be greater than 0.99. Remember, an r value of one would have indicated a perfect positive correlation. This means that our r value indicates a positive correlation that is close to perfect. In other words, for these two variables, there is a strong positive correlation between time and temperature, which is visible on the scatter plot and trend line.

Example 2

(Scatter plot representing the rate of Cybercrime per 100,000 population as a function of the Growth Rate (%) in 2017-2018. The trend line rises slightly)

In reality, the relationship between two values is unlikely to be as obvious as the link between the amount of time in a kettle and water temperature. Let's look at a real life example that compares population growth with cybercrime in 2018. What does the scatterplot tell us? First, on the X-axis we see that, as the population growth rate values increase, so do the cybercrime rate values on the Y axis. This implies that we should have a positive correlation. At the same time, we noticed that the data points are well spread out. It's hard to draw a straight line through these data points, while keeping each data point close to the line. This would lead us to believe that there is not a strong correlation. To be sure, we decide to use software to calculate our correlation coefficient and we see that r equals 0.3, this signifies a weak positive correlation. Therefore, after visualizing the data and determining the correlation coefficient, we can conclude that in 2018, there was a weak positive correlation between population growth and cybercrime.

Knowledge check

(Scatter plot representing where the data points appear to decrease in value depending on the X-axis)

Let's take a break to test your knowledge about correlation. Take a look at the scatter plot on the right hand side of the slide. What is it telling us? Is there A) positive correlation between these two variables? B) A negative correlation? Or C) no correlation at all? The answer is B! This scatterplot is visualizing a strong negative correlation between these two variables.

Next, imagine that you are analyzing three pairs of variables. The correlation coefficient for these three pairs are -0.8; 0.03 and 0.42. Which r value indicates the strongest relationship? The answer is a), r equals -0.8. This indicates a strong negative relationship. The weakest of these three options is b), r equals 0.03, which indicates no relationship between the variables.

Correlation =/= Causality

Now, let's move on to causality. In fact, if there is one key message you take away from this video, let it be this: Correlation and causality, though sometimes use incorrectly as interchangeable concepts, are anything but. So far, we've learned that the correlation coefficient tells us how strongly a pair of variables are linearly related and change together. However, it does NOT tell us the reason why or how. Causality does. Causality is when there is a real world explanation for WHY this is logically happening. You may have also heard this referred to as "cause and effect".

Causality

Causality is a relationship between two events, or variables, in which one event or process causes an effect on the other event or process. For example, research tells us that there is a positive correlation between ice cream sales and sunburns. Meaning, as ice cream sales increase, so do instances of sunburns. But this doesn't mean that buying an ice cream cone causes a sunburn now does it? Of course not. Causality adds real world context and meaning to the correlation.

(Series of images showing that ice cream sales and the number of sunburns are correlated but that each is caused by the sun)

Causality refers to a relationship between two events, or variables, which has a valid explanation. Unlike correlation, with causality, this valid explanation turns possibility into actuality. To say something causes an effect on another variable means the result of one event is directly influenced by the other. Either the cause precedes the effect, or the effect changes when the cause changes. For example, dry, hot and sunny weather will cause people to buy more ice cream than in cold weather. Dry, hot and sunny weather will also cause an increase in sunburns when compared to colder, rainy weather. This can make it appear that buying ice cream causes sunburns, but this is just not true. When it comes to hot, sunny weather, ice cream sales and sunburns, all three are correlated, but the only causal relationships in this scenario are between the weather and ice cream sales and the weather and sunburnt people.

Beware the confirmation bias!

Similar to how the human mind loves to see patterns, it also tends to more easily accept evidence that agrees with existing beliefs, rather than that which refutes them. This is called confirmation bias. So, when analyzing your data, it is very important to scrutinize conclusions you like just as rigorously as ones you don't, in order to avoid claiming a causal relationship exists between two things when in fact, it does not.

How to determine a causal relationship

There isn't an easy statistical test to test for causal relationship, statistical confirmation of causality typically requires advanced modeling techniques. However, when trying to establish whether causality is present, typically, if the following 4 criteria are met, the greater the chance of causality between your two variables. First, just as with correlation, the two variables must vary together, meaning, a positive or negative correlation coefficient has been shown to exist. Next, that relationship must be plausible. And really, what this is saying is that the relationship needs to make sense. Third, the cause must precede the effect in time. Meaning, the cause must take place first, in order for the effect to occur. And finally, the relationship must not be due to a third variable. A relationship that appears to be between two variables but could also be explained by third is also referred to as spurious relationship. We previously saw this in our example referring to increased ice cream sales being correlated with increased instances of sunburns, but really, both increases were the effect of a third variable, the sun.

Knowledge check: Is this relationship causal?

(Scatter plot representing the hours before the person eats again based on the weight of the cake consumed (kg). The trend line is rising)

Now let's take a look at the scatter plot and try to determine whether or not there is a causal relationship between the amount of cake a person eats and how how full they feel, which we measure by the amount of time that passes before the person eats again. In this example, we will assume that all respondents are similar except for the amount of cake they have consumed. Think about the four criteria we just went through: do the two variables vary together? Is the relationship plausible? Does the cause precede the effect in time? And is the relationship due to a third variable?

(Text: Yes - r = 0.918; Yes - digestion processes; Yes - cake is eaten first; Not likely - if controlled for other food eaten)

After addressing the four criteria we established to help determine whether the relationship is causal, we have determined that first, the variables do indeed vary together. Yes, there is a plausible relationship. Yes, the cake is eaten first and that's what causes the effect of fullness. And, in this instance, it is unlikely that the feeling of fullness has been caused by third variable, since we have controlled for all non-cake-based foods.

The importance of knowing the difference

(Scatter plot representing the Grade Point Average (GPA) as a function of the years of music lesson. The trend line appears to be rising)

A common problem occurs when two correlated trends are presented as one phenomenon causing the other. For example, this scatter plot shows a relationship between taking music lessons and achieving a high grade point average, or GPA. The graph seems to indicate that there is a correlation between the years of music lessons and average GPA. But do music lessons directly impact or cause an increase in GPA? Social research shows these high performing students are also more likely to have grown up in an environment with large emphasis on education and the resources needed to succeed academically. It is therefore possible that these students would have higher academic achievements with or without music lessons, and that their socio-economic status would actually explain the relationship. So while music lessons and academic achievements are correlated, there are other factors that should prevent us from establishing causality.

Recap of key points

Here is a review of the key points we've covered in this video. First, correlation refers to the relationship between two variables. It is important to look for the existence, direction, and strength of the relationship. Then, we learned how to assess the strength and direction of a correlation by calculating the correlation coefficient, r. Data visualization also provides us with a quick way to identify correlations. Next, we explained how causality refers to a relationship between two events or variables, which has a valid explanation. And finally, it is important to remember that correlation does not always imply causation. Even if two variables are strongly correlated, it could just be a coincidence.

(The Canada Wordmark appears.)

What did you think?

Please give us feedback so we can better provide content that suits our users' needs.