Health Reports
Positional accuracy of geocoding from residential postal codes versus full street addresses

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

by Saeeda Khan, Lauren Pinault, Michael Tjepkema and Russell Wilkins

Release date: February 21, 2018

More information

Geocoding based on a full street address is a highly accurate way of assigning geographic coordinates to an individual’s residential location. However, address information is usually not available. As well, for many types of population health research, such precision is not necessary.

The positional accuracy required for a study depends on its context. To assign characteristics such as neighbourhood socioeconomic variables, or to obtain census denominator data for a geographically defined at-risk population, reasonable accuracy is usually sufficient. By contrast, for some environmental health research, such as assigning walkability scores to urban residents, highly accurate spatial resolution may be required, since error of even a few hundred metres can result in substantial exposure misclassification.^{Note 1}^{Note 2}^{Note 3}^{Note 4}^{Note 5}

In Canada, postal codes are often the only geographic identifier available for assigning contextual or environmental information to survey or administrative data. The Postal Code Conversion File (PCCF)^{Note 6} and the Postal Code Conversion File Plus (PCCF+)^{Note 7} provide the means to geocode datasets using only postal codes. To date, hundreds of studies have been published using these tools.^{Note 8}

The primary function of PCCF tools is to assign the full hierarchy of census geography (dissemination block, dissemination area, census tract, census metropolitan area or census agglomeration, census subdivision, census division, and province).^{Note 9} A secondary function is assignment of latitude and longitude coordinates based on blockface, dissemination block, or dissemination area centroids.

This study compares the positional accuracy of geocoding using postal codes versus roof-top centroids derived from full street addresses (“reference locations”). The analysis is based on self-reported address data from the 2011 Census of Population. All reported postal codes were processed through PCCF+ Version 6C to obtain three variables that influence the accuracy of geocoding from postal codes: delivery mode type (DMT), representative point type, and community size. Positional accuracy is measured by the distance by which the geocoded position differs from the latitude/longitude of the reference location.

Delivery mode type (DMT)

DMT is a feature of postal codes that indicates the mode of mail delivery. DMTs can be grouped into three major categories: urban (A, B, E, G, M), rural (W), and mixed (H, J, K, T, X). Urban DMTs may be residential (A or B) or commercial and institutional (E, G, and M). Retired postal codes (Z) were excluded from the study, as were postal codes with a DMT of J (general delivery at an urban post office) or X (mobile routes in industrial areas), as they were rare in the census.

Urban DMTs.DMT A is for ordinary urban households with letter carrier delivery to the household or to a community mailbox. DMT B is for large urban apartment buildings; DMT E, for urban businesses; DMT G, for large urban commercial establishments and institutions with letter carrier delivery; and DMT M, for large urban commercial establishments and institutions that retrieve their mail at a lock box or bag at an urban post office. Most urban postal codes are linked to blockface or dissemination block representative points (centroids), usually with a single link, but sometimes with up to four.

Rural DMTs. Postal codes with “0” in the second position (DMT W) designate rural post offices, which may serve a large area encompassing multiple dissemination areas and census subdivisions (townships or villages), and all types of users, including apartment buildings, businesses, and institutions. DMT W also covers service to remote areas, including service via air stage offices,^{Note 10} which were excluded from the sample. Most rural postal codes are linked to a large number of representative points (centroids of multiple dissemination areas).

Mixed DMTs. DMT H, K, and T are mixed (partly urban, partly rural). DMT H and T are for rural and suburban routes beginning at urban post offices but extending into rural areas. DMT of K (small post office boxes) may serve urban or rural clients, but at an urban post office. Most mixed mode postal codes are linked to multiple dissemination area centroids.

Representative point type

Representative point type refers to the source of the latitude and longitude coordinates assigned―in order of decreasing precision: blockface, dissemination block, dissemination area, or census subdivision. Census subdivision representative points are assigned as an interim measure pending further postal code address information, and occur rarely; postal codes with that representative point type were excluded.

Community size

Community size indicates the 2011 Census population: 1,500,000 or more (Toronto, Montreal and Vancouver census metropolitan areas—CMAs); 500,000 to 1,499,999 (Ottawa-Gatineau, Edmonton, Calgary, Quebec, Winnipeg and Hamilton CMAs); 100,000 to 499,999 (remaining 18 CMAs and 7 largest census agglomerations—CAs); 10,000 to 99,999 (remaining CAs); and less than 10,000 (rural and small-town Canada—all areas except CMAs or CAs).

Sample selection

The analytical sample (n = 1,004) was randomly selected from the census dataset based on retaining a target of 100 observations for each DMT, representative point type, and community size category (Table 1). An effort was made to sample equally from each region (Atlantic, Ontario, Quebec, Prairies, British Columbia, and northern territories). When possible, the same observations were re-used across DMT, representative point type, and community size categories.

Table 1
Sample size and population represented, by delivery mode type, representative point type, and community size, Canada, 2011

The study intentionally oversampled from rural and mixed mode postal codes, from less accurate dissemination area representative point type, and from the smallest census agglomerations and rural and small-town Canada (fewer than 10,000 population).

Geocoding from full street addresses

To determine reference locations, all street addresses in the sample were manually searched using Google Maps with satellite imagery. The latitude and longitude for the centroid of the building location were obtained. When possible, the building address was visually confirmed with Google Street View.

Addresses not located during manual geocoding because of poor satellite imagery or a combination of lack of street address and no Street View were excluded, and new addresses were resampled (n = 121). Records that required resampling tended to be in rural and remote areas or on Indian reserves.

If possible, reported postal codes were verified on the Canada Post website.^{Note 11} Except for mixed mode postal codes (DMT H or T), if the first three characters did not match the expected postal code, it was presumed to be erroneous and was excluded from the analytical dataset.

All addresses in the sample were matched with all possible postal code representative points from PCCF+. Most addresses had multiple possible representative points. Of 105 records with very long distances (more than 50 km) between the PCCF+ representative point and the address location from Google Maps, 51 were excluded because the postal code or street address appeared to contain typographical errors.

Distance calculations

For each record in the sample, all possible representative points were extracted from PCCF+ 6C. All of them retained their weights, starting with those that were assigned population-based weights—rural postal codes (DMT W), mixed postal codes (DMT H, K, or T), urban postal codes (DMT A, B, E, G, or M) with matches to four or more dissemination areas, and incompletely enumerated Indian reserves. Postal codes with a match to a single representative point were assigned to that point with a weight of 1. Urban postal codes with matches to multiple representative points were assigned equal weights. This weighting strategy replicates the reference location assignment process in PCCF+.

The manually assigned reference locations based on full street addresses and the representative points generated by PCCF+ were mapped in Geographic Information Systems (ArcGIS v.10, ESRI 2010). The Euclidean (straight-line) distance between each reference location and each possible representative point was measured.

A weighted mean distance was calculated between the reference location and the PCCF+ representative points, using all the possible representative points for each postal code. For unique matches, only one representative point was possible (Figure 1A), but for urban records with multiple possible matches, all representative points had an equal probability of being assigned (Figure 1B). Figure 1C illustrates a rural case with three possible PCCF+ representative points for the postal code, with probabilities of assigning one of the three of 70%, 20%, and 10%. Those probability weights were used to calculate a weighted mean distance between the reference location and the three PCCF+ representative points. Using this process, one mean distance was calculated for each record.

Figure 1
Sample calculations for mean distance between respondent house location (from satellite imagery) and postal code representative points (from PCCF+)

Based on the calculated distance (or mean distance) between the reference location and the location geocoded from the postal code, measures of error, such as median distance and interquartile range (IQR), were calculated by DMT, representative point type, and community size. In addition, the percentages of the sample geocoded to within various distances (0.5, 1, 5, and 10 km) from the reference location were calculated.

Of the 4,401 matches generated after merging with the postal code representative points, outliers with improbably long distances (more 50 km) were reviewed manually (178 matches, representing 105 records). A large portion were deemed plausible, as they were in suburban or rural areas, and were geocoded using population-weighted allocation (79 matches representing 54 sample records). They were retained because residents of some remote areas may travel long distances to retrieve their mail. Furthermore, the low population weight prevented outliers from skewing the mean distance calculation. Three geographic matches in the north (two in Dawson City; one in Yellowknife) that represented extremely long distances (up to 1,068 km) were deemed improbable and removed. The remaining long distances were typographical errors in census-reported postal codes or errors in geocoding the street address and were removed (99 matches, representing 51 records). The final study sample was n = 1,004 addresses.

Results

Table 2 shows the median (and IQR) distance between the reference locations and the population-weighted mean PCCF+ representative points, by DMT, representative point type, and community size. Urban postal codes had greater accuracy and lower variability than did rural and mixed postal codes. The best accuracy was for large apartment buildings (DMT B) with a median distance of 110 m (IQR 60 m to 200 m), followed by ordinary households (DMT A) with a median of distance 160 m (IQR 80 m to 320 m). Commercial and institutional services (DMT E, G, or M) had relatively good accuracy, with a median distance of 210 m, but more variability (IQR 80 m to 1.25 km). The greatest median distances were for mixed mode services (rural services from urban post offices, DMT H, K, or T): 5.87 km (IQR 3.81 km to 8.75 km) and rural services (DMT W): 5.20 km (IQR 2.93 km to 9.24 km).

Table 2
Median distance and interquartile range between location geocoded from full street address and location geocoded from postal code, by delivery mode type, representative point type, and community size, Canada, 2011

The finer the scale of the data source for representative point types, the better the accuracy. The median distance for blockfaces was 130 m (IQR 60 m to 230 m), compared with 330 m (IQR 160 m to 750 m) for dissemination blocks and 5.53 km (IQR 2.89 km to 8.66 km) for dissemination areas, 95% of which had mixed and rural postal codes (data not shown).

Accuracy generally improved with increasing community size. Rural and small-town areas had, the longest median distance: 5.60 km (IQR 2.42 km to 9.48 km). Median distances for other community size groups varied from 160 m (IQR 80 m to 830 m) in the largest urban size group to 330 m (IQR 130 m to 3.93 km) in the smallest.

Table 3 shows the percentage of the study sample geocoded by PCCF+ to within various distances from the full street address. Of the sample with urban residential postal codes, 87% of ordinary households (DMT A), 92% of large apartment buildings (DMT B), and 66% of businesses or institutions (DMT E, G, M) were geocoded to within 500 m of their full street address. By contrast, only 47% with rural postal codes (DMT W) and 39% with mixed urban and rural postal codes (DMT H, K, T) were geocoded to within 5 km of their residence.

Table 3
Percentage geocoded from postal code to within various distances of location geocoded from full street address, by delivery mode type, representative point type, and community size, Canada, 2011

By representative point type, 91% of records that geocoded to a blockface and 65% that geocoded to a dissemination block were within 500 m of their full street address. Only 46% of records with dissemination area representative points were geocoded to within 5 kmof their residence.

The percentages of the sample geocoded to within 500 m of their full street address were 56% for communities of 10,000 to 99,999 population and more than 70% for communities of at least 500,000. Just 46% of the sample in rural areas and small towns were geocoded to within 5 km of their residence.

Discussion

This study examines the positional accuracy of geocoding based on PCCF+ compared with geocoding from full street addresses.

The results support the literature on postal code geocoding—urban areas have much less error than rural areas.^{Note 12}^{Note 13} The analysis further demonstrates the importance of DMT to geocoding accuracy. In addition to previously documented problems geocoding rural regions,^{Note 12}^{Note 14} this study reveals high levels of error associated with mixed DMTs. Such DMTs include rural and suburban routes from urban post offices and general delivery and post office box services at urban post offices, which serve many rural residents.

Even in urban areas it is important to distinguish between ordinary residential and commercial or institutional delivery. Large commercial and institutional establishments with more than one physical location (campuses, branches, outlets) may have all their mail addressed to a single postal code (DMT M, with lock box or bag pick up at a post office) and make deliveries from that location to the final destinations. It is also possible that some of the reported census postal codes with a DMT of E, G, or M pertained to place of work rather than residence. Either problem may have contributed to the greater distances observed for urban business and institutional postal codes (DMT E, G, M), compared with urban residential postal codes (DMT A, B).

Community size and representative point type influenced positional accuracy, largely because of their relationship to the mode of mail delivery (DMT). Most residents of rural and small-town Canada use rural or mixed mode postal codes, and in PCCF+, such postal codes are almost exclusively linked to dissemination area centroids.

Despite the risk of positional inaccuracy and population misclassification associated with various DMT categories, removal of a specific DMT or community size group might introduce selection bias.

The positional accuracy required for a study depends on the spatial resolution of the environmental or contextual measures of interest.^{Note 2}^{Note 3} A mismatch between the measure and the positional accuracy of geocoding from postal codes alone can result in misclassification. A comparison of 1996-Census-assigned enumeration area income quintiles with those assigned by an older version of PCCF+ showed that misclassification ranged from 3% to 10% in urban areas, but from 39% to 50% in rural areas.^{Note 14} In urban areas, misclassification tended to result in assignment to the next-higher or -lower income quintile; in rural areas, 15% to 20% of misclassification was off by two or more quintiles.

The trend in air pollution studies is toward finer-scale data. For instance, satellite-derived ambient fine particulate matter (PM_2.5) estimates were developed at 10-km²-resolution,^{Note 15} but have been refined to 1-km²-resolution.^{Note 16} Finer spatial resolution is possible in urban population centres, but may be problematic in rural areas, where the accuracy of geocoding from postal codes is much reduced.

The demand for urban form data (for example, walkability, density, urban sprawl, and greenness) has increased—for instance, in studies of the impact of the built environment on physical activity, overweight, and obesity.^{Note 17}^{Note 18}^{Note 19}^{Note 20} Some of these impacts may be examined at coarser scales, such as census metropolitan area or dissemination area,^{Note 18}^{Note 21} but others rely on environmental data specific to a residential block.^{Note 20} The degree of spatial resolution in urban form studies typically varies between 500 m (a 5-minute walk) and 1 km. Based on the present analysis, such studies should exclude rural (DMT W) and mixed (DMT H, J, K or T) delivery modes, but could include urban residential delivery modes (DMT A and B) in smaller communities.

Most Canadians have a single postal code that corresponds to their residence. Others, particularly in rural areas, may have an additional postal code corresponding to a post office box or rural route from an urban post office. Postal codes that correspond to post office boxes and rural routes from urban post offices are common in population-based files; however, gecoding from such “mixed” mode postal codes has the same level of accuracy as geocoding from rural postal codes.

Some postal codes with a DMT of E, G, or M may correspond to legitimate places of residence (for instance, nursing homes, university residences, prisons, apartments in a mainly business building), but the majority are commercial or institutional establishments. Their inclusion in administrative records probably reflects reporting place of work rather than residence. PCCF+ flags such records to draw attention to the possible non-residential nature of those postal codes.

Strengths and limitations

A strength of this study was the ability to randomly sample address data from the Census of Population, the most complete survey of residential postal codes in Canada.^{Note 22} This made it possible to compile an analytical file that included different DMTs, representative point types, and community sizes across the country. As well, the study benefitted from the availability of Google mapping software to help manually geocode addresses using satellite photography and confirm the addresses using Google Street View. Ortho-rectified aerial photography and satellite imagery are among the most accurate methods for geocoding residences.^{Note 1}^{Note 3}^{Note 23}

Two types of error arise when using self-reported address data, particularly postal codes, for spatial analysis: reporting errors and geocoding or positional/geographic assignment errors. Although this study focused on the latter, attempts were made to mitigate census reporting errors, particularly in the postal code. Errors in the first three digits (forward sortation area) generate larger spatial errors than do those in the last three digits (local delivery unit). An evaluation of the impact of address error on public health surveillance in Montreal found that of the 10% of records in the dataset with errors in the address, almost 80% were the result of inaccurate postal codes; 20% contained errors in the forward sortation area, 60% contained errors in the local delivery unit, and 20% contained errors in both.^{Note 24} For the present study, when possible, postal codes were verified on the Canada Post website^{Note 11}; census postal codes (other than for DMT H, K or T) whose forward sortation area did not match the forward sortation area generated from the full address were excluded.

This study’s reference location used Google Maps satellite imagery to assign latitude and longitude to roof-top centroids based on full street addresses. However, a degree of error is associated with this method.^{Note 25}^{Note 26} Furthermore, the quality of the satellite images varied and may have contributed to error in assigning the home centroid. This was mitigated by excluding all addresses for which a centroid could not be selected for the house or a small cluster of adjacent buildings (for example, farmhouse and adjacent barns). The resampling that resulted from poor-quality satellite imagery and lack of Google Street View may have introduced bias, because these areas are more remote and disproportionately impact certain groups, such as residents of the far North or Indian Reserves.

As indicated by the wide IQR for DMT A (80 m to 320 m), urban DMTs can, themselves, be heterogeneous, probably reflecting the density of dwellings and period of construction (post-World War II suburban development versus pre-war downtowns). Further analysis could explore the implications of such characteristics on the accuracy of geocoding from postal codes.

Conclusion

Based on address information from a random sample of census respondents, this study assesses positional accuracy of geocoding from postal codes versus full street addresses. Positional accuracy was related to mode of delivery (urban, rural, or mixed), source of latitude and longitude information (blockface, dissemination block, or dissemination area centroids), and community size. Differences across community size groups and representative point types were related to mode of mail delivery.

The results highlight the impact of delivery mode type on positional accuracy. Both rural and mixed (partly urban, partly rural) postal codes had much higher geocoding error than did urban postal codes. These findings demonstrate the importance of understanding how the positional accuracy of geocoding from postal codes differs depending on the nature of the postal codes. Because such differences can substantially alter the effect of environmental and contextual variables, first stratifying by delivery mode type or community size is recommended. The spatial resolution required by the environmental or contextual measures of interest can help analysts to identify subpopulations that should be excluded from a study.

References

Footnote 1.

Mazumdar S, Rushton G, Smith BJ, et al. Geocoding accuracy and the recovery of relationships between environmental exposures and health. International Journal of Health Geographics 2008; 7: 13.

Return to note 1 referrer

Footnote 2.

DeLuca PF, Kanaroglou PS. Effects of alternative point pattern geocoding procedures on first and second order statistical measures. Spatial Science 2008; 53(1): 131–41.

Return to note 2 referrer

Footnote 3.

Bonner MR, Han D, Nie J, et al. Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology 2003; 14(4): 408–12.

Return to note 3 referrer

Footnote 4.

Burra T, Jerrett M, Burnett RT, Anderson M. Conceptual and practical issues in the detection of loca disease clusters: a study of mortality in Hamilton, Ontario. The Canadian Geographer 2002; 46(2): 160–71.

Return to note 4 referrer

Footnote 5.

Guernsey JR, Dewar R, Weerasinghe S, et al. Incidence of cancer in Sydney and Cape Breton County, Nova Scotia 1979-1997. Canadian Journal of Public Health 2000; 91(4): 285-92.

Return to note 5 referrer

Footnote 6.

Statistics Canada. Postal CodeOM Conversion File (PCCF), Reference Guide, 2016 (Catalogue 92-154-G) Ottawa: Statistics Canada, 2016.

Return to note 6 referrer

Footnote 7.

Statistics Canada. Postal CodeOM Conversion File Plus (PCCF+) Version 6C, Reference Guide (Catalogue 82-E0086-XDB) Ottawa: Minister of Industry, 2016.

Return to note 7 referrer

Footnote 8.

Peller P. An Analysis of the Postal Code Conversion File’s Use in Research. Calgary, Alberta: University of Calgary, 2011.

Return to note 8 referrer

Footnote 9.

Statistics Canada. Census Dictionary―Census Year, 2011 (Catalogue 98-301-X2011001). Available at: http://www12.statcan.gc.ca/census-recensement/2011/ref/dict/index-eng.cfm

Return to note 9 referrer

Footnote 10.

Canada Post Corporation. Air stage offices list. Available at: https://www.canadapost.ca/tools/pg/prices/RCRZ-e-AIR.pdf. Accessed July 11, 2016.

Return to note 10 referrer

Footnote 11.

Canada Post Corporation. Postal Code Lookup Tool. Available at: www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf

Return to note 11 referrer

Footnote 12.

Healy MA, Gilliland JA. Quantifying the magnitude of environmental exposure misclassification when using imprecise address proxies in public health research. Spatial and Spatio-temporal Epidemiology 2012; 3: 55–67.

Return to note 12 referrer

Footnote 13.

Ng E, Wilkins R, Perras A. How far is it to the nearest hospital? Calculating distances using the Statistics Canada Postal Code Conversion File. Health Reports 1993; 5(2): 179–88.

Return to note 13 referrer

Footnote 14.

Wilkins R. Neighbourhood income quintiles derived from Canadian postal codes are apt to be misclassified in rural but not urban areas. Ottawa: Statistics Canada, 2004. Available at: file=r3jc96paper.pdf [available online at www.ResearchGate.net]

Return to note 14 referrer

Footnote 15.

Crouse DL, Peters PA, Hystad P, et al. Ambient PM2.5, O3, and NO2 exposures and associations with mortality over 16 years of follow-up in the Canadian Census Health and Environment Cohort (CanCHEC). Environmental Health Perspectives 2015; 123(11): 1180–6.

Return to note 15 referrer

Footnote 16.

Pinault L, Tjepkema M, Crouse DL, et al. Risk estimates of mortality attributed to low concentrations of ambient fine particulate matter in the Canadian Community Health Survey cohort. Environmental Health 2016; 15(18): 1–15.

Return to note 16 referrer

Footnote 17.

Pouliou T, Elliott SJ. Individual and socio-environmental determinants of overweight and obesity in urban Canada. Health & Place 2010; 16: 389–98.

Return to note 17 referrer

Footnote 18.

Seliske L, Pickett W, Janssen I. Urban sprawl and its relationship with active transportation, physical activity and obesity in Canadian youth. Health Reports 2012; 23(2): 17–25.

Return to note 18 referrer

Footnote 19.

de Sa E, Ardern CI. Associations between the built environment, total, recreation, and transit-related physical activity. BMC Public Health 2014; 14(693): 1–8.

Return to note 19 referrer

Footnote 20.

Polsky JY, Moineddin R, Dunn JR, et al. Absolute and relative densities of fast-food versus other restaurants in relation to weight status: Does restaurant mix matter? Preventive Medicine 2016; 82: 28–34.

Return to note 20 referrer

Footnote 21.

Winter M, Barnes R, Venners S, et al. Older adults’ outdoor walking and the built environment: Does income matter? BMC Public Health 2015; 15(876): 1–8.

Return to note 21 referrer

Footnote 22.

Mechanda K, Puderer, H. How Postal Codes Map to Geographic Areas. Geography Working Paper Series (Statistics Canada Catalogue 92F0138MIE – no. 001) Ottawa: Minister of Industry, 2007.

Return to note 22 referrer

Footnote 23.

Bow CJD, Water NM, Faris PD, et al. Accuracy of city postal code coordinates as a proxy for location of residence. International Journal of Health Geographics 2004; 3: 1–9.

Return to note 23 referrer

Footnote 24.

Zinszer K, Jauvin C, Verma A, et al. Residential address error in public health surveillance data: A description and analysis of the impact on geocoding. Spatial and Spatio-temporal Epidemiology 2010; 1: 163–8.

Return to note 24 referrer

Footnote 25.

Ubukawa T. An Evaluation of the Horizontal Positional Accuracy of Google and Bing Satellite Imagery and Three Road Data Sets Based on High Resolution Satellite Imagery. New York, New York: Centre for International Earth Science Information Network, The Earth Institute at Columbia University, 2013: 1–16.

Return to note 25 referrer

Footnote 26.

Potere D. Horizontal positional accuracy of Google Earth’s high resolution imagery archive. Sensors 2009; 8: 7973–81.

Return to note 26 referrer

Date modified:: 2018-02-21

Language selection

Search and menus

Search

Health Reports
Positional accuracy of geocoding from residential postal codes versus full street addresses

Archived Content

Delivery mode type (DMT)

Representative point type

Community size

Sample selection

Geocoding from full street addresses

Distance calculations

Results

Discussion

Strengths and limitations

Conclusion

Health Reports Positional accuracy of geocoding from residential postal codes versus full street addresses

Archived Content

Delivery mode type (DMT)

Representative point type

Community size

Sample selection

Geocoding from full street addresses

Distance calculations

Results

Discussion

Strengths and limitations

Conclusion

Note of appreciation

Standards of service to the public

Copyright

Health Reports
Positional accuracy of geocoding from residential postal codes versus full street addresses