The Open Database of Recreational and Sport Facilities (ODRSF)
Metadata document: concepts, methodology and data quality

Version 1.0

Data Exploration and Integration Lab (DEIL)
Centre for Special Business Projects (CSBP)

Release date: September 28, 2021

Table of Contents

  1. Overview
  2. Data Sources
  3. Reference Period
  4. Target Population
  5. Compilation Methodology
  6. Database coverage
  7. Data Dictionary
  8. Data Accuracy
  9. Contact us

1. Overview

The Open Database of Recreational and Sport Facilities (ODRSF) is a database of recreational and sport facilities released as open data. Data sources include various levels of government within CanadaFootnote 1 and professional organizations. This document details the process of collecting, compiling, and standardizing the individual datasets used to create the ODRSF.

This dataset is one of a number of datasets created as part of the Linkable Open Data Environment (LODE). The LODE is an exploratory initiative that aims at enhancing the use and harmonization of open data from authoritative sources by providing a collection of datasets released under a single licence, as well as open-source code to link these datasets together. Access to the LODE datasets and code are available through the Statistics Canada website and can be found at the Linkable Open Data Environment.

The ODRSF is made available under the Open Government Licence – Canada. In its current version (Version 1.0), the ODRSF contains approximately 182,000 individual records. The database is expected to be updated periodically as new open datasets become available. The ODRSF is provided as a compressed comma separated values (CSV) file.

2. Data Sources

A total of 452 data sources were used to create the ODRSF. The sources used are detailed in a 'Data Sources' CSV file provided together with the data file on the ODRSF webpage. The links to the original datasets, licenses or terms of use, attribution statements and additional notes are also included in the 'Data Sources' CSV file. All recreational and sport facility data in the ODRSF were collected from government data sources, either from open data portals or from publicly-available web pages.

The distinction between open and other publicly available data is based on the licensing terms attached to each source dataset used. Open data licenses permit, in varying degrees, usability for any lawful purpose, redistribution (re-sharing) and modification and re-packaging of the data. However, open data licenses can impose some restrictions, such as attribution of original source, share-alike (re-sharing only with like conditions), and no commercial use. In general, no warranty is expressed and there are minor conditions stipulated by the provider.

Publicly available data, that are not associated with an open data license, are generally provided with terms of use that may restrict some of the aspects that would otherwise be permitted under open data licensing.

For further information on the individual licences or terms of use, users should consult the information provided on the open data portals or web sites of the various data providers, as reported in the Data Sources CSV file.

3. Reference Period

The Data Sources CSV reports, when this is known, either the update frequency or the date each underlying dataset was last updated by the provider (this information is collected at the time the dataset was accessed for this project). Additionally, the Data Sources CSV provides the date that each dataset used in the ODRSF was downloaded or provided by the organization that is the source of the data. Data were gathered in 2020 and 2021. Users are cautioned that the download date should not be used as an indication of the reference date of the data.

4. Target Population

For the purposes of the ODRSF database, recreational and sport facilities are facilities wherein the primary activity is concerned with either recreation or sport. The target population includes brick and mortar recreational and sport facilities that offer programs or services to the general public as well as those such as trails for hiking or skiing, sports fields, and other types of facilities that may be located outside of brick and mortar structures.

In terms of the North American Industry Classification System (NAICS), the facilities in the ODRSF are primarily in the following sub-sectors:

  • 7112 – Spectator sports
  • 7131 – Amusement parks and arcades
  • 7139 – Other amusement and recreation industries

Facilities are included when their primary activities are related to recreation or sports, regardless of the source of funding, private or public status, operator type, location or other attributes. However, facilities that are not open to the general public are not included. It should be noted that the focus of the ODRSF is on the facility (point of service). This may or may not correspond to a business entity, as some facilities such as trails or beaches may not be associated with any business entity while others, for example a multi-sport complex, may be related to a number of discrete entities.

5. Compilation Methodology

Data Standardization and Cleaning

The first processing component for compiling the ODRSF database was comprised of reformatting the source data to CSV format and mapping the original dataset attributes to standard variable (field) names. This was done using a version of the custom OpenTabulate software developed by the LODE team. A data dictionary of the variables used is provided in section 7. The methodology and limitations of the techniques used in each step used in the data cleaning process are described below.

Address Parsing

Natural language processing methods were used for parsing and separation of address strings into address variables, such as street number and postal code (which is removed from the final released database). The methods are reputable in the field for performance and accuracy, but as with all statistical learning methods, they have limitations as well. Poor or unconventional formatting of addresses may result in incorrect parsing. At this stage, no further integration with other address sources was attempted; hence, although address records are generally expected to be correct, residual errors may be present in the current version of the database.

When address information was available, addresses were parsed using the same methodology applied to other LODE databases such as the Open Database of Education Facilities and the Open Database of Cultural and Arts Facilities. The libpostal address parser, an open source natural language processing solution to parsing addresses, was used to split concatenated address strings into strings corresponding to address variables, such as street name and street number. Occasionally, addresses were split incorrectly due to unconventional formatting of the original address.Footnote 2

For instance, a limited number of entries were manually edited when it was clear that the parsing had not been done correctly. An example is addresses with hyphenated numbers such as "1035-55 street nw", which may have been interpreted as having a civic number of "1035-55" and a street name of "street nw", rather than a civic number of 1035, and a street name of "55 street nw". While effort was made to ensure that the results are correct, it is possible that the scripts used to process and parse the addresses may unintentionally cause other, undetected, errors.

Removal of Duplicates

As data were sourced from entities with geographically overlapping jurisdictions (e.g., a province, municipality and a private sector organization), the same record can appear in more than one source dataset. The removal of duplicates was done using both literal and fuzzy string matching on the facility name and street name, conditioned on the street number and province; by "conditioned," it is meant that a fuzzy comparison between two facilities is made provided that the street numbers and provinces agree. The fuzzy comparison is done using Levenshtein distances calculated through the Python package FuzzyWuzzy,Footnote 3 which returns a similarity score between 0 and 100 for two strings where a score of 100 indicates that the shorter string is a sub-string of the larger string. An entry is marked as a duplicate when that score meets a given threshold of similarity.

If two entries contained identical street number and province information, then their street names and facility names were compared. When these were nearly identical (defined as having the sum of the similarity scores for the facility names and street names to be at least 195 out of a possible 200), then the entries were marked as duplicates. Recognized duplicates were deleted without manual intervention. The chosen threshold was selected close to the maximum score, which minimized any removal of false positives. When duplicates were found, whichever record contained more non-empty fields was retained. In total, 5,937 duplicates were removed.

Although deduplication techniques are used, not all duplicates might have been removed. Modifying the deduplication methods to seek out the remaining duplicates would generate numerous false positives, which would require additional manual intervention.

Identification of Invalid Entries and Other Data Cleaning Steps

Identifying erroneous entries was done both programmatically and manually. Data entries that could not be correctly processed by automated techniques were filtered and stored in a separate file and manually corrected later. Data entries were formatted through the removal of excess whitespace and punctuation, standardization of fields such as postal code, and province/territory names.

Classification and Assignment of Recreational and Sport Facility Type

The original data sources use a variety of standards, classifications and nomenclature to describe the various types of recreational and sport facility. With no classification for recreational and sport facilities that is broadly adopted and recognized in Canada, one of the main challenges in the implementation of the ODRSF was the harmonization of records into comparable groups. Assignment of facility type was largely based on facility types provided by source datasets. In instances where facility type was either unclear or not defined by the source, facility type was classified based on further research or using meta-information, such as name of dataset.

The following classification of recreational and sport facilities is used for Version 1.0 of the ODRSF. While most of the class names are self-explanatory, further clarifications are provided below. In addition, and where available, the facility types as provided in the data sources (e.g., outdoor pool, tennis court, sports field, etc.) are also included in the ODRSF without any modification, reassignment, or mapping to a uniform classification.

  • trails: urban and rural trails or pathways for walking, hiking, or biking.
  • sports fields: fields on which sports can be played.
  • arenas: facilities where sports and/or recreational activities take place.
  • athletic parks: recreation areas focused on athletic activity.
  • beaches: waterfront beach areas.
  • casinos: casino or gambling facilities.
  • community centres: community centres and leisure facilities.
  • gyms: both public and private gym facilities.
  • marinas: marina facilities.
  • parks: parks and greenspaces, including both city and national parks.
  • playgrounds: play spaces which are distinct from parks in that they have specifically been classified as such by the publisher of the data. Often includes playground equipment.
  • pools: indoor and outdoor swimming pools.
  • race tracks: tracks for racing.
  • rinks: most commonly ice rinks.
  • skate parks: parks used for skateboarding.
  • splash pads: urban areas for water play.
  • stadiums: facilities where sports and/or recreational activities take place.
  • miscellaneous: facilities that do not fall into any of the above categories.

The classification is intended to have broad categories that are helpful in distinguishing major types of facilities and yet enable accuracy in mapping source-specific facility types. Facility types are determined from source-specific facility types and source coverage metadata information. Assignments are made using keywords and validated afterwards, with changes made manually whenever needed. When classifying facilities based on source metadata information, this was done analytically on a case by case basis.

The sports field classification category combines multiple types of sports fields such as baseball fields, soccer fields, and others. Where available, the detailed information on the type of sports field is preserved in the Source Facility Type variable.

Geocoding and Determination of Census Subdivision

In general, the data included in the ODRSF are what is available from the original sources without imputation. The exception to this is the geocoding and the imputation of CSD names and categories, discussed below.

Census subdivision (CSD)Footnote 4 names were derived from two different attributes in the data. The first attribute comprises the geographic coordinates, namely latitude and longitude. These are placed into the corresponding CSDs by linking the coordinate points to the CSD polygons through a spatial join operation using the Python package GeoPandas.Footnote 5

The second attribute is the city name, where literal string matching was done with each recreational and sport facility municipality name and a list of CSD names.Footnote 6

Geocoding was carried out for some sources that provide address data but no geo-coordinates. Latitude and longitude were determined and validated using tools on the internet. A subset of the source-provided geo-coordinates were also validated using the internet.

6. Database Coverage

The current version of the ODRSF (Version 1.0) database as provided contains approximately 182,000 recreational and sport facilities.

As the total number of all recreational and sport facilities in the country is not known with a reasonable degree of certainty, the coverage obtained with the sources used was not able to be thoroughly quantitatively assessed. Looking at the individual category of golf courses, however, shows that the ODRSF contains 592 golf courses, approximately 25% of all the 2,182 golf courses estimated to be in Canada.Footnote 7 Likewise, there were 1,303 rinks and arenas located in the ODRSF, approximately 60% of the 2,183 rinksFootnote 8 and arenas estimated to be in Canada. The distribution of the latter category showed similar coverage trends across geographies, with 82% to 87% of arenas and rinks respectively being located in Ontario and the Prairie provinces compared to an estimated two-thirds for these types of facilities overall.

From the above results, it is clear that the ODSRF is not a comprehensive listing of facilities within Canada. This is to be expected since not all jurisdictions publish data on recreational and sport facilities or categorize them in the same ways. The exception to this is when sources do purport to list all facilities of a certain type within a jurisdiction so for those particular facility type categories and jurisdictions; for those sources, coverage would be expected to be fairly complete. However, if facilities of a certain category were omitted by a source, then those facilities might be missing from the database unless they were obtained from a different source.

7. Data Dictionary

Recreation and sport facilities varables

Variable – Index

Name
Index
Format
String
Source
Internally generated during data processing
Description
Unique number automatically generated during data processing

Variable – Facility Name

Name
Facility_Name
Format
String
Source
Provided as is from original data
Description
Recreational or sport facility name

Variable – Source Facility Type

Name
Source_Facility_Type
Format
String
Source
Provided as is from original data
Description
Facility type chosen by data provider

Variable – ODRSF Facility Type

Name
ODRSF_Facility_Type
Format
String
Source
Generated from source data or metadata
Description
Facility type assigned from ODRSF categories

Location Variables

Variable – Unit Number

Name
Unit
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic unit or suite number

Variable – Street Number

Name
Street_No
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street number

Variable – Street Name

Name
Street_Name
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street name

Variable – Street Type

Name
Street_Type
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street type

Variable – Street Direction

Name
Street_Direction
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street direction

Variable – Postal Code  

Name
Postal_Code
Format
String
Source
Parsed from a full address string or provided as is
Description
Postal code

Variable – City  

Name
City
Format
String
Source
Parsed from a full address string or provided as is
Description
City or municipality name (certain records may list the neighbourhood name)

Variable – Province/Territory

Name
Prov_Terr
Format
String
Source
Converted to two letter codes (internationally approved) after parsing from a full address string, or provided as is, or indicated by providers
Description
Province or territory name

Variable – Province Unique Identifier

Name
PRUID
Format
Integer
Source
Converted from province code
Description
Province unique identifier

Variable – CSD Name

Name
CSD_Name
Format
String
Source
Imputed from geographic coordinates and city names using GeoSuite 2016
Description
Census subdivision name

Variable – CSD Unique Identifier

Name
CSDUID
Format
Integer
Source
Imputed from either geographic coordinates or CSD name using GeoSuite 2016
Description
Census subdivision unique identifier

Variable – Longitude

Name
Longitude
Format
Float
Source
Provided as is from original data or added by geolocation
Description
Longitude

Variable – Latitude

Name
Latitude
Format
Float
Source
Provided as is from original data or added by geolocation
Description
Latitude

Variable – Data Provider

Name
Provider
Format
String
Source
Created based on origins of input dataset
Description
Name of the entity that provided the dataset

8. Data Accuracy

All addresses were collected from authoritative government sources, made available to the public as open data. In general, other than the processing required to harmonize the different sources into one database, the underlying datasets obtained from the various open data portals were taken "as-is".

During the processing stage to create the ODRSF, several steps were taken to standardize the output, including the standardization of street types and a deduplication of entries. It is possible that the process used to standardize the addresses may have introduced some errors, but these are expected to be minimal. Likewise, it is possible that duplicate entries remain in the database. To control for possible processing inaccuracies, the full address column is also provided without standardization applied.

9. Contact Us

The LODE open databases are modelled on ongoing improvement. To provide information on additions, updates, corrections or omissions, or for more information, please contact us at statcan.lode-ecdo.statcan@statcan.gc.ca. Please include the title of the open database in the subject line of the email.

Date modified: