The Open Database of Healthcare Facilities (ODHF)
Metadata document: concepts, methodology and data quality

Version 1.1

Data Exploration and Integration Lab (DEIL)
Centre for Special Business Projects (CSBP)

August 7, 2020

Table of Contents

1. Overview

The Open Database of Healthcare Facilities (ODHF) is a Canada-wide healthcare facilities database. It has been compiled by the Centre for Special Business Projects (CSBP) at Statistics Canada. This document discusses the methodology used to create the ODHF. This document pertains to the first update of the ODHF (version 1.1) in August 2020. The first version of the ODHF was published in April 2020 and the main updates for version 1.1 include the addition of 5 new data sources, updates to entries with the collaboration of the data providers, and enhanced deduplication.

The database uses both open data as well as publicly available data (a dataset being designated open depending on whether or not the data are distributed under an open data license). Most of the data are sourced from municipal, regional and provincial/territorial governments, federal agencies, or independent not-for-profit organizations specializing in the health information field. The data have been either web-scraped, downloaded, or obtained directly from the data sources.

The main objective of producing the ODHF is the dissemination of this information through the harmonization and integration of, and, to a limited extent, the addition of geolocation information to the data assembled from the various sources used.

Version 1.1 of the ODHF contains 7,033 individual records. This is a reduction of approximately 2,000 records relative to version 1.0. This difference is primarily due to enhanced deduplication (over 1,600 entries removed) applied in version 1.1, but also due to removing some records at the request of data providers and replacing the source of data used for the province of Québec, which was web-scraped in version 1.0 and replaced with an open source in version 1.1. The ODHF is provided as a compressed comma separated values (CSV) file. The database is expected to be updated periodically as new datasets become available or as other improvements are made.

The ODHF is one of several datasets created as part of the Linkable Open Data Environment (LODE), an initiative at CSBP. The LODE is an exploratory initiative that aims at enhancing the use and harmonization of open and publicly available data from authoritative sources by providing a collection of datasets released under a single licence. The LODE also provides open-source code to link these datasets together. Access to the LODE datasets and code are available through the Statistics Canada Linkable Open Data Environment website.

2. Target Population

A healthcare facility is a physical site at which the primary activity is the provision of healthcare. Healthcare facilities in Canada that provide healthcare services are in scope for this dataset. Specifically, in terms of the North American Industry Classification System (NAICS), the following industries are in scope:

  • 621 - Ambulatory health care services
  • 622 - Hospitals
  • 623 - Nursing and residential care facilities

Facilities are included when their primary activities relate to healthcare, regardless of the source of funding, private or public status, operator type, location or other attributes not listed here. Furthermore, as only one type is assigned to each facility, it's possible it may offer multiple types of service but will only be listed as one. Alternative medicine (e.g., herbalists) and specialist areas (e.g., chiropractors, dentists, mental health specialists, etc.) are not included in the current ODHF version (version 1.1). However, when the sources used contained these out-of-scope facilities, some of these might still be present in the ODHF database.

Facilities that are in areas indirectly related to overall healthcare delivery, e.g., pharmacies, social assistance, etc., are also not in scope of the current version of the ODHF.

3. Data Sources

The sources used are detailed in Appendix A for open data sources and in Appendix B for publicly available data sources. The links to the original datasets, licenses or terms of use, attribution statements and additional notes are also included in Appendix A and Appendix B. An additional 5 data sources have been added in the 1.1 update. At the request of some of the data providers, some entries have been updated or removed.

Nearly all data sources used to create this database are publicly available sources, such as municipal governments, provincial/territorial governments and health authorities and agencies, and independent not-for-profit organizations specializing in the health information field. The data were obtained either from open data portals located on websites, through web-scraping, or were provided directly by the source. In most cases, sources were discovered using major search engines or through professional contacts. Sources were sought in all Canadian provinces and territories.

The distinction between open and other publicly available data is based on the licensing terms (explicit or implicit) attached to each source dataset used. Open data licenses permit, in varying degrees, usability for any lawful purpose, redistribution (re-sharing) and modification and re-packaging of the data. However, open data licenses can impose some restrictions, such as attribution of original source, share-alike (re-sharing only with like conditions), and no commercial use. Examples of open data licenses are Creative Commons, MIT, GPLv3, and Canada's Open Government License. In general, no warranty is expressed and there are very minor conditions stipulated by the provider.

Publicly available data that are not open data might be associated with proprietary licensing or terms of use that may restrict some of the aspects that would otherwise be permitted under open data licensing. The sources are detailed in Appendix A for open data sources and in Appendix B for other publicly available data sources.

The links to original datasets used for the current version of the ODHF (version 1.1), licenses or terms of use, attribution statements and additional notes are also included in Appendices A and B. For further information on the individual licenses, users should consult directly with the information provided on the data portals for the data providers.

4. Reference Period and Last Update Dates

In principle, the reference date of the database would represent the date for which all healthcare facilities in existence at that time would be included in the dataset. Ideally, this would be the same date for all datasets used. However, this is not the case and the reference dates vary by provider. In some cases, such detail was not present in the information made available by data providers.

Appendix A and Appendix B provide the date when each source dataset was last updated by the provider (this information is collected at the time the dataset was accessed for this project). As all data sources only had one version available, this is what has been used and taken to be the most current available.

Users are cautioned that the last update date should not be interpreted as the reference date of the data. If specific information concerning the reference period of data is required, users should contact the appropriate data providers shown in Appendix A: Open Data Sources and Appendix B: Other Publicly Available Data Sources.

5. Compilation Methodology

This section provides an overview of the processing done to compile the ODHF.

Data Cleaning

The primary processing component for the database comprised reformatting the source data to CSV format and mapping the original dataset attributes to the variable (column) names defined for this database. A data dictionary of the variables used for this database is provided in section 8 Data Dictionary. To clean the data, the following was done:

  • Address parsing and normalization
    • Concatenated address data were parsed and separated into the respective location variables using libpostal, a state-of-the-art natural language processing solution for address parsing. A small number of addresses were parsed incorrectly and were manually corrected.
    • Data entry formatting (removal of excess whitespace and punctuation), normalization of postal codes and addresses, province/territory names.
    • Some data entries that were filtered out by automated cleaning methods were manually corrected. See section 8 for more details.
  • Removal of duplicates
    • The removal of duplicates is done using fuzzy string matching based on criteria involving the facility name, street name, street number and geo-coordinates. The criteria were derived empirically and with the intent of avoiding false positives.
  • Identification of erroneous entries
    • Identifying erroneous entries was done both programmatically and manually. Data entries that could not be correctly processed by automated techniques were filtered and stored in a separate file and manually corrected later.
  • Selection of record to retain in case of duplicates
    • In some instances, a facility was present in more than one source.In such cases, the record with the most information available was retained. Where information between sources did not match, validation tools were used to decide which to retain.

For the 1.1 update, a more rigorous deduplication process was carried out to remove some duplicates that existed in the first release. This process was carried out using the Python Record Linkage Toolkit package to perform string comparisons on the various columns of the database, and the Scikit Learn package to perform a machine learning classification to identify potential duplicate records. Entries without enough information to be classified in this way were processed by considering all record pairs in the same province and with facility name comparison scores above a certain threshold as potential duplicates. All potential duplicates identified using this approach were then manually verified before removing. For the purpose of this database, the unit of analysis is a healthcare facility rather than any particular service, and therefore in instances where one facility (such as a hospital complex) contains multiple individual services, these are reduced down to a single entry. As a result of this process, over 1,600 duplicates were removed.

During validation, changes may have been applied to facility names and addresses when deemed appropriate. This may cause occasional discrepancies between the street number and street name columns and the original source address column.

For more details on the software used to process the data, please refer to CSBP's GitHub page.

Determination of Healthcare Facility Types

The original data sources use a variety of standards, classifications and nomenclatures to describe the type of healthcare facility. Unfortunately, there is no classification for healthcare facilities in Canada that is used universally. Health authorities classify their facilities independently using different classifications systems. The following classification of healthcare facilities is used currently for the database:

  • Ambulatory health care services: Establishments primarily engaged in providing health care services, directly or indirectly, to ambulatory patients. (Example: medical clinic, mental health center.)
  • Hospitals: Establishments, licensed as hospitals, primarily engaged in providing diagnostic and medical treatment services, and specialized accommodation services to in-patients. (Example: emergency department, general hospital.)
  • Nursing and residential care facilities: Establishments primarily engaged in providing residential care combined with either nursing, supervisory or other types of care as required by the residents (Example: nursing home.)

The classification is intended to have broad categories that are helpful in distinguishing major types of facilities and yet enable accuracy in mapping source-specific facility types. Facility types are determined from source-specific facility types (e.g., cancer treatment centers are classified as 'Hospitals') and source coverage metadata information. Assignments are done using keywords and validated afterwards, with changes made manually whenever needed. When classifying facilities based on source metadata information, this was done analytically on a case by case basis.

Table 1 illustrates the use of keywords to assign type categories to the healthcare facilities based on the classification used for the ODHF.

Table 1 Healthcare facility type assignment criteria examples (based on keywords)
Variable Condition Value Classification
Facility type contains the keywords 'community health center', 'clinic' Ambulatory health care services
Facility type contains the keywords 'hospital', 'cancer treatment', 'emergency', cancer centre', 'health centre' Hospitals
Facility type contains the keywords 'senior active living', 'nursing home', 'long-term care' Nursing and residential care facilities

Geocoding and Determination of Census Subdivision (CSD or Municipality)

Geocoding was carried out for some sources that provide address data but no geo-coordinates. Latitude and longitude were determined and validated using tools on the internet. A subset of the source-provided geo-coordinates were also validated using the internet. Some coordinates have also been removed from the original sources when it was determined they were derived from postal codes or other aggregate geographic areas as opposed to street address.

Note: While efforts have been made to ensure the accuracy of geo-coordinates, no guarantees are implied, and errors and inaccuracies are possible.

Census subdivision (CSD)Footnote 1 (or municipality) was derived from the geographic coordinates by linking to the CSD polygons through a spatial join operation using the Python package GeoPandas or by using the city name available in the record's address field using GeoSuite.

6. Database Coverage

The ODHF current version (version 1.1) database as provided contains 7,033 healthcare facilities.

As the total number of all healthcare facilities in the country is not known with a reasonable degree of certainty, the coverage obtained with the sources used was not quantitatively assessed. However, many of the sources purport to list all institutions of a certain type (e.g. acute care hospital, residential care) within a jurisdiction. Thus, within these institution type categories and jurisdictions, coverage would be expected to be fairly complete. However, if facilities of a certain category were omitted in a source, e.g., outpatient medical clinics, then these might be missing from the database, unless they were obtained from a different source.

7. Data Quality

The accuracy and completeness of the information is in general a function of the source datasets used. Except as noted, the underlying datasets are taken "as is".

Classifying facilities
Assignment of facility type was largely based on facility types provided by source datasets. In instances where facility type was either unclear or not defined by the source, facility type was classified based on further research.
Duplicates
Some datasets provide data where the rows do not represent unique facilities. Although deduplication techniques are used, it is expected that there are some duplicates remaining.
Address parsing
Natural language processing methods were used to do the parsing and separation of address strings into address variables, such as postal code and street number. The methods are reputable for state-of-the-art performance and accuracy, but as with all statistical learning methods, they have limitations as well. Poor or unconventional formatting of addresses might result in incorrect parsing. Upon manual review of the database, no incorrect parses were identified. At this stage, address records in the database are expected to be correctly parsed.
Geo-coordinates
Some facilities that did not have geo-coordinates were geocoded using OpenStreetMap's Nominatim API. The accuracy of the geocoding was manually validated by using proprietary mapping services available on the internet. In some cases, facility coordinates were also manually determined from online map services.

8. Data Dictionary

This data dictionary describes the variables contained within the ODHF. The database is provided in a CSV format. Each facility is listed per row and its attributes provided in columns. The corresponding column variables are described in the data dictionary below.

Healthcare Facility Variables

Variable - Index

Name
index
Format
Alphanumeric
Source
Assigned serially
Description
Unique serial number for each facility. Supplemental entries to version 1.1 are identified by the prefix "S" followed by an assigned serial number.

Variable - Facility Name

Name
facility_name
Format
String
Source
Provided as is from original data
Description
Healthcare facility name

Variable – Source Facility Type

Name
source_facility_type
Format
String
Source
Provided as is from original data
Description
Regional health authority assigned healthcare facility type

Variable – ODHF Facility Type

Name
odhf_facility_type
Format
String
Source
Imputed from source data or metadata
Description
Value determined using the classification criteria used (see section 5)

Variable – Provider

Name
provider
Format
String
Source
Assigned based on the provider's identity
Description
The identity or name of the data provider

Location Variables

Variable – Unit Number

Name
unit
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic unit or suite number

Variable – Street Number

Name
street_no
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street number

Variable – Street Name

Name
street_name
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic street name (type and direction)

Variable – Postal Code

Name
postal_code
Format
String
Source
Parsed from a full address string or provided as is
Description
Civic postal code

Variable – City

Name
city
Format
String
Source
Parsed from a full address string or provided as is
Description
City name

Variable – Province/Territory

Name
province
Format
String
Source
Converted to two letter codes after parsing from a full address string, or provided as is, or indicated by the provider
Description
Province or territory name

Variable – Source-Format Street Address

Name
source_format_str_address
Format
String
Source
Street address from the data source provided as is
Description
Street address in the source data

Variable – CSD Name

Name
CSDname
Format
String
Source
Imputed from geographic coordinates and city names
Description
Census subdivision name

Variable – CSD Unique Identifier

Name
CSDuid
Format
Integer
Source
Imputed from CSD name using GeoSuite 2016
Description
Census subdivision unique identifier

Variable – Province or Territory Unique Identifier

Name
PRuid
Format
Integer
Source
Imputed from CSD unique identifier by taking the first two digits
Description
Province unique identifier

Variable – Latitude

Name
latitude
Format
Float
Source
Provided as is from original data or corrected value if source value found inaccurate during validation
Description
Latitude

Variable – Longitude

Name
longitude
Format
Float
Source
Provided as is from original data or corrected value if source value found inaccurate during validation
Description
Longitude

9. Contact Us

Statistics Canada's open data projects are modelled on ongoing improvement. To provide information on additions, updates, corrections or omissions, or for more information, please contact us at statcan.lode-ecdo.statcan@statcan.gc.ca. Please include the title of the open database in the subject line of the email.

Appendix A: Open Data Sources

Open Data Sources
Data provider Province / territory Link License / Terms of Use Last updated by provider Description New source for ODHF v1.1
British-Columbia (Province) British-Columbia British Columbia - Data Catalogue - Emergency Rooms in BC Open Government Licence - British Columbia 12/24/2019 Emergency services in British-Columbia No
British Columbia (Province) British Columbia British Columbia - Data Catalogue - Hospitals in BC Open Government Licence - British Columbia 12/25/2019 Hospitals in British Columbia No
British Columbia (Province) British Columbia British Columbia - Data Catalogue - Residential Care Facilities Open Government Licence - British Columbia 12/26/2019 Residential care in British Columbia No
British Columbia (Province) British Columbia British Columbia - Data Catalogue - Walk-in Clinics in BC Open Government Licence - British Columbia 12/27/2019 Walk-ins in British-Columbia No
Moncton (Municipality) New Brunswick City of Moncton - Senior Care Facilities City of Moncton - Open Data Terms of Use 3/19/2010 Senior care facilities within the Greater Moncton area Yes
Moncton (Municipality) New Brunswick City of Moncton - Medical Clinics City of Moncton - Open Data Terms of Use 3/19/2010 Medical clinics in the Greater Moncton area Yes
New Brunswick (Province) New Brunswick Digital New Brunswick - Map of Licensed Nursing Homes Open Government Licence - New Brunswick 07/16/2019 Licensed nursing homes in New Brunswick Yes
Nova Scotia (Province) Nova Scotia Open Data Nova Scotia - Hospitals Nova Scotia Open Government Licence 2/15/2019 Hospitals in Nova-Scotia No
Prince Edward Island (Province) Prince Edward Island PEI Health Facilities PEI Health Facilities 4/17/2020 Healthcare facilities in Prince Edward Island Yes
Prince Edward Island (Province) Prince Edward Island Open Data Prince Edward - Health PEI Facility Locations Open Government Licence - Prince Edward Island 8/8/2019 Healthcare facilities in Prince Edward Island No
Québec City, Québec (Municipality) Québec Données Québec - Ville de Québec - Lieux publics Creative Commons - Attribution 4.0 International (CC BY 4.0) 2/24/2020 Hospitals in Québec City, Québec No
Québec (Province) Québec Santé et des Services sociaux Québec - Fichier cartographique des installations - M02 Données Québec - Licence Creative Commons (CC BY) 5/20/2020 Healthcare and social services facilities in the province of Québec Yes
Gatineau, Québec (Municipality) Québec Données Québec - Ville de Gatineau - Lieux publics Creative Commons - Attribution 4.0 International (CC BY 4.0) 2/25/2019 Hospitals in Gatineau, Québec No
Nova Scotia (Province) Nova Scotia Open Data Nova Scotia - Long Term Care and Residential Care Facilities Nova Scotia Open Government Licence 2/15/2019 Residential care in Nova Scotia No
Ontario (Province) Ontario Ontario GeoHub - Ministry of Health Service Provider Locations
(via: Ontario Data catalogue - Hospital locations)
Open Government Licence - Ontario 10/15/2019 Healthcare facilities in Ontario No
Horizon Regional Health Authority (New Brunswick) New Brunswick Digital New Brunswick - Hospitals in New Brunswick Operated by Horizon Health Network Open Government Licence - New Brunswick 3/18/2020 Hospitals in New Brunswick operated by Horizon No
Vitalité Regional Health Authority (New Brunswick) New Brunswick Digital New Brunswick - Hospitals in New Brunswick Operated by Vitalité Health Network Open Government Licence - New Brunswick 3/18/2020 Hospitals in New Brunswick operated by Vitalité No
Alberta (Province) Alberta Alberta Open Government - Hospital services in Alberta Open Government Licence - Alberta 7/1/2018 Hospitals and healthcare facilities in Alberta No
Manitoba (Province) Manitoba Manitoba Government - Rural Health Care Facilities in Manitoba (Waived) 6/30/2017 Healthcare facilities in Manitoba No

Appendix B: Other Publicly Available Data Sources or Sources of Directly-Provided Data

Other Publicly Available Data Sources or Sources of Directly-Provided Data
Data Provider Province/ Territory Link License / Terms of Use Last Updated by Provider Description
Canadian Institute for Health Information Canada Provided directly via email (Waived) not available Healthcare facilities in Canada
Manitoba (Province) Manitoba Manitoba Government - Health Services Wait Time Information - Map of Facilities Manitoba Government - Copyright (Waived) not available Hospitals in Manitoba
Manitoba - Winnipeg Regional Health Authority Manitoba Winnipeg Regional Health Authority - Location and Services Winnipeg Regional Health Authority - Terms of Use and Privacy Statement not available Locations of facilities managed by the Winnipeg Regional Health Authority
Manitoba - Interlake-Eastern Regional Health Authority Manitoba Interlake-Eastern Regional Health Authority - Hospital Locations N/A not available Locations of facilities managed by the Interlake-Eastern Regional Health Authority
Manitoba - Northern Health Region Manitoba Northern Health Region N/A not available Locations of facilities managed by the Northern Health Region
Manitoba - Prairie Mountain Health Manitoba Prairie Mountain Health - Locations Map Prairie Mountain Health - Legal Notice and Disclaimer not available Locations of facilities managed by the Prairie Mountain Health Authority
Manitoba - Southern Health Region Manitoba Southern Health - Finding Care Southern Health - Disclaimers - Terms and Conditions not available Locations of facilities managed by the Southern Health Authority
Nunavut (Territory) Nunavut The Government of Nunavut - Qikiqtani General Hospital N/A not available Single hospital in Nunavut
Public Health Agency of Canada Canada Provided directly via email (Waived) not available Hospitals in Canada
Newfoundland and Labrador (Province) Newfoundland and Labrador Government of Newfoundland and Labrador - Services in Your Region Government of Newfoundland and Labrador- Disclaimer / Copyright / Privacy Statement not available Healthcare facilities in Newfoundland and Labrador
Northwest Territories (Territory) Northwest Territories Government of Northwest Territories - Hospitals and Health Centres Government of Northwest Territories - Terms of use (Waived) not available Healthcare facilities in Nortwest Territories
Manitoba (Province) Manitoba Interlake-Eastern Regional Health Authority N/A not available Healthcare facilities in Manitoba
Yukon (Territory) Yukon Provided directly to CSBP via email (Waived) not available Healthcare facilities in Yukon Territories
Saskatchewan (Province) Saskatchewan Saskatchewan Health Authority - Locating Facility and Service Information N/A not available Healthcare facilities in Saskatchewan
Date modified: