Health Reports
Positional accuracy of geocoding from residential postal codes versus full street addresses

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

by Saeeda Khan, Lauren Pinault, Michael Tjepkema and Russell Wilkins

Release date: February 21, 2018

Geocoding based on a full street address is a highly accurate way of assigning geographic coordinates to an individual’s residential location. However, address information is usually not available. As well, for many types of population health research, such precision is not necessary.

The positional accuracy required for a study depends on its context. To assign characteristics such as neighbourhood socioeconomic variables, or to obtain census denominator data for a geographically defined at-risk population, reasonable accuracy is usually sufficient. By contrast, for some environmental health research, such as assigning walkability scores to urban residents, highly accurate spatial resolution may be required, since error of even a few hundred metres can result in substantial exposure misclassification.Note 1Note 2Note 3Note 4Note 5

In Canada, postal codes are often the only geographic identifier available for assigning contextual or environmental information to survey or administrative data. The Postal Code Conversion File (PCCF)Note 6 and the Postal Code Conversion File Plus (PCCF+)Note 7 provide the means to geocode datasets using only postal codes. To date, hundreds of studies have been published using these tools.Note 8

The primary function of PCCF tools is to assign the full hierarchy of census geography (dissemination block, dissemination area, census tract, census metropolitan area or census agglomeration, census subdivision, census division, and province).Note 9 A secondary function is assignment of latitude and longitude coordinates based on blockface, dissemination block, or dissemination area centroids.

This study compares the positional accuracy of geocoding using postal codes versus roof-top centroids derived from full street addresses (“reference locations”). The analysis is based on self-reported address data from the 2011 Census of Population. All reported postal codes were processed through PCCF+ Version 6C to obtain three variables that influence the accuracy of geocoding from postal codes: delivery mode type (DMT), representative point type, and community size. Positional accuracy is measured by the distance by which the geocoded position differs from the latitude/longitude of the reference location.

Delivery mode type (DMT)

DMT is a feature of postal codes that indicates the mode of mail delivery. DMTs can be grouped into three major categories: urban (A, B, E, G, M), rural (W), and mixed (H, J, K, T, X). Urban DMTs may be residential (A or B) or commercial and institutional (E, G, and M). Retired postal codes (Z) were excluded from the study, as were postal codes with a DMT of J (general delivery at an urban post office) or X (mobile routes in industrial areas), as they were rare in the census.

Urban DMTs.DMT A is for ordinary urban households with letter carrier delivery to the household or to a community mailbox. DMT B is for large urban apartment buildings; DMT E, for urban businesses; DMT G, for large urban commercial establishments and institutions with letter carrier delivery; and DMT M, for large urban commercial establishments and institutions that retrieve their mail at a lock box or bag at an urban post office. Most urban postal codes are linked to blockface or dissemination block representative points (centroids), usually with a single link, but sometimes with up to four.

Rural DMTs. Postal codes with “0” in the second position (DMT W) designate rural post offices, which may serve a large area encompassing multiple dissemination areas and census subdivisions (townships or villages), and all types of users, including apartment buildings, businesses, and institutions. DMT W also covers service to remote areas, including service via air stage offices,Note 10 which were excluded from the sample. Most rural postal codes are linked to a large number of representative points (centroids of multiple dissemination areas).

Mixed DMTs. DMT H, K, and T are mixed (partly urban, partly rural). DMT H and T are for rural and suburban routes beginning at urban post offices but extending into rural areas. DMT of K (small post office boxes) may serve urban or rural clients, but at an urban post office. Most mixed mode postal codes are linked to multiple dissemination area centroids.

Representative point type

Representative point type refers to the source of the latitude and longitude coordinates assigned―in order of decreasing precision: blockface, dissemination block, dissemination area, or census subdivision. Census subdivision representative points are assigned as an interim measure pending further postal code address information, and occur rarely; postal codes with that representative point type were excluded.

Community size

Community size indicates the 2011 Census population: 1,500,000 or more (Toronto, Montreal and Vancouver census metropolitan areas—CMAs); 500,000 to 1,499,999 (Ottawa-Gatineau, Edmonton, Calgary, Quebec, Winnipeg and Hamilton CMAs); 100,000 to 499,999 (remaining 18 CMAs and 7 largest census agglomerations—CAs); 10,000 to 99,999 (remaining CAs); and less than 10,000 (rural and small-town Canada—all areas except CMAs or CAs).

Sample selection

The analytical sample (n = 1,004) was randomly selected from the census dataset based on retaining a target of 100 observations for each DMT, representative point type, and community size category (Table 1). An effort was made to sample equally from each region (Atlantic, Ontario, Quebec, Prairies, British Columbia, and northern territories). When possible, the same observations were re-used across DMT, representative point type, and community size categories.

The study intentionally oversampled from rural and mixed mode postal codes, from less accurate dissemination area representative point type, and from the smallest census agglomerations and rural and small-town Canada (fewer than 10,000 population).

Geocoding from full street addresses

To determine reference locations, all street addresses in the sample were manually searched using Google Maps with satellite imagery. The latitude and longitude for the centroid of the building location were obtained. When possible, the building address was visually confirmed with Google Street View.

Addresses not located during manual geocoding because of poor satellite imagery or a combination of lack of street address and no Street View were excluded, and new addresses were resampled (n = 121). Records that required resampling tended to be in rural and remote areas or on Indian reserves.

If possible, reported postal codes were verified on the Canada Post website.Note 11 Except for mixed mode postal codes (DMT H or T), if the first three characters did not match the expected postal code, it was presumed to be erroneous and was excluded from the analytical dataset.

All addresses in the sample were matched with all possible postal code representative points from PCCF+. Most addresses had multiple possible representative points. Of 105 records with very long distances (more than 50 km) between the PCCF+ representative point and the address location from Google Maps, 51 were excluded because the postal code or street address appeared to contain typographical errors.

Distance calculations

For each record in the sample, all possible representative points were extracted from PCCF+ 6C. All of them retained their weights, starting with those that were assigned population-based weights—rural postal codes (DMT W), mixed postal codes (DMT H, K, or T), urban postal codes (DMT A, B, E, G, or M) with matches to four or more dissemination areas, and incompletely enumerated Indian reserves. Postal codes with a match to a single representative point were assigned to that point with a weight of 1. Urban postal codes with matches to multiple representative points were assigned equal weights. This weighting strategy replicates the reference location assignment process in PCCF+.

The manually assigned reference locations based on full street addresses and the representative points generated by PCCF+ were mapped in Geographic Information Systems (ArcGIS v.10, ESRI 2010). The Euclidean (straight-line) distance between each reference location and each possible representative point was measured.

A weighted mean distance was calculated between the reference location and the PCCF+ representative points, using all the possible representative points for each postal code. For unique matches, only one representative point was possible (Figure 1A), but for urban records with multiple possible matches, all representative points had an equal probability of being assigned (Figure 1B). Figure 1C illustrates a rural case with three possible PCCF+ representative points for the postal code, with probabilities of assigning one of the three of 70%, 20%, and 10%. Those probability weights were used to calculate a weighted mean distance between the reference location and the three PCCF+ representative points. Using this process, one mean distance was calculated for each record.

Based on the calculated distance (or mean distance) between the reference location and the location geocoded from the postal code, measures of error, such as median distance and interquartile range (IQR), were calculated by DMT, representative point type, and community size. In addition, the percentages of the sample geocoded to within various distances (0.5, 1, 5, and 10 km) from the reference location were calculated.

Of the 4,401 matches generated after merging with the postal code representative points, outliers with improbably long distances (more 50 km) were reviewed manually (178 matches, representing 105 records). A large portion were deemed plausible, as they were in suburban or rural areas, and were geocoded using population-weighted allocation (79 matches representing 54 sample records). They were retained because residents of some remote areas may travel long distances to retrieve their mail. Furthermore, the low population weight prevented outliers from skewing the mean distance calculation. Three geographic matches in the north (two in Dawson City; one in Yellowknife) that represented extremely long distances (up to 1,068 km) were deemed improbable and removed. The remaining long distances were typographical errors in census-reported postal codes or errors in geocoding the street address and were removed (99 matches, representing 51 records). The final study sample was n = 1,004 addresses.

Results

Table 2 shows the median (and IQR) distance between the reference locations and the population-weighted mean PCCF+ representative points, by DMT, representative point type, and community size. Urban postal codes had greater accuracy and lower variability than did rural and mixed postal codes. The best accuracy was for large apartment buildings (DMT B) with a median distance of 110 m (IQR 60 m to 200 m), followed by ordinary households (DMT A) with a median of distance 160 m (IQR 80 m to 320 m). Commercial and institutional services (DMT E, G, or M) had relatively good accuracy, with a median distance of 210 m, but more variability (IQR 80 m to 1.25 km). The greatest median distances were for mixed mode services (rural services from urban post offices, DMT H, K, or T): 5.87 km (IQR 3.81 km to 8.75 km) and rural services (DMT W): 5.20 km (IQR 2.93 km to 9.24 km).

The finer the scale of the data source for representative point types, the better the accuracy. The median distance for blockfaces was 130 m (IQR 60 m to 230 m), compared with 330 m (IQR 160 m to 750 m) for dissemination blocks and 5.53 km (IQR 2.89 km to 8.66 km) for dissemination areas, 95% of which had mixed and rural postal codes (data not shown).

Accuracy generally improved with increasing community size. Rural and small-town areas had, the longest median distance: 5.60 km (IQR 2.42 km to 9.48 km). Median distances for other community size groups varied from 160 m (IQR 80 m to 830 m) in the largest urban size group to 330 m (IQR 130 m to 3.93 km) in the smallest.

Table 3 shows the percentage of the study sample geocoded by PCCF+ to within various distances from the full street address. Of the sample with urban residential postal codes, 87% of ordinary households (DMT A), 92% of large apartment buildings (DMT B), and 66% of businesses or institutions (DMT E, G, M) were geocoded to within 500 m of their full street address. By contrast, only 47% with rural postal codes (DMT W) and 39% with mixed urban and rural postal codes (DMT H, K, T) were geocoded to within 5 km of their residence.

By representative point type, 91% of records that geocoded to a blockface and 65% that geocoded to a dissemination block were within 500 m of their full street address. Only 46% of records with dissemination area representative points were geocoded to within 5 kmof their residence.

The percentages of the sample geocoded to within 500 m of their full street address were 56% for communities of 10,000 to 99,999 population and more than 70% for communities of at least 500,000. Just 46% of the sample in rural areas and small towns were geocoded to within 5 km of their residence.

Discussion

This study examines the positional accuracy of geocoding based on PCCF+ compared with geocoding from full street addresses.

The results support the literature on postal code geocoding—urban areas have much less error than rural areas.Note 12Note 13 The analysis further demonstrates the importance of DMT to geocoding accuracy. In addition to previously documented problems geocoding rural regions,Note 12Note 14 this study reveals high levels of error associated with mixed DMTs. Such DMTs include rural and suburban routes from urban post offices and general delivery and post office box services at urban post offices, which serve many rural residents.

Even in urban areas it is important to distinguish between ordinary residential and commercial or institutional delivery. Large commercial and institutional establishments with more than one physical location (campuses, branches, outlets) may have all their mail addressed to a single postal code (DMT M, with lock box or bag pick up at a post office) and make deliveries from that location to the final destinations. It is also possible that some of the reported census postal codes with a DMT of E, G, or M pertained to place of work rather than residence. Either problem may have contributed to the greater distances observed for urban business and institutional postal codes (DMT E, G, M), compared with urban residential postal codes (DMT A, B).

Community size and representative point type influenced positional accuracy, largely because of their relationship to the mode of mail delivery (DMT). Most residents of rural and small-town Canada use rural or mixed mode postal codes, and in PCCF+, such postal codes are almost exclusively linked to dissemination area centroids.

Despite the risk of positional inaccuracy and population misclassification associated with various DMT categories, removal of a specific DMT or community size group might introduce selection bias.

The positional accuracy required for a study depends on the spatial resolution of the environmental or contextual measures of interest.Note 2Note 3 A mismatch between the measure and the positional accuracy of geocoding from postal codes alone can result in misclassification. A comparison of 1996-Census-assigned enumeration area income quintiles with those assigned by an older version of PCCF+ showed that misclassification ranged from 3% to 10% in urban areas, but from 39% to 50% in rural areas.Note 14 In urban areas, misclassification tended to result in assignment to the next-higher or -lower income quintile; in rural areas, 15% to 20% of misclassification was off by two or more quintiles.

The trend in air pollution studies is toward finer-scale data. For instance, satellite-derived ambient fine particulate matter (PM2.5) estimates were developed at 10-km2-resolution,Note 15 but have been refined to 1-km2-resolution.Note 16 Finer spatial resolution is possible in urban population centres, but may be problematic in rural areas, where the accuracy of geocoding from postal codes is much reduced.

The demand for urban form data (for example, walkability, density, urban sprawl, and greenness) has increased—for instance, in studies of the impact of the built environment on physical activity, overweight, and obesity.Note 17Note 18Note 19Note 20 Some of these impacts may be examined at coarser scales, such as census metropolitan area or dissemination area,Note 18Note 21 but others rely on environmental data specific to a residential block.Note 20 The degree of spatial resolution in urban form studies typically varies between 500 m (a 5-minute walk) and 1 km. Based on the present analysis, such studies should exclude rural (DMT W) and mixed (DMT H, J, K or T) delivery modes, but could include urban residential delivery modes (DMT A and B) in smaller communities.

Most Canadians have a single postal code that corresponds to their residence. Others, particularly in rural areas, may have an additional postal code corresponding to a post office box or rural route from an urban post office. Postal codes that correspond to post office boxes and rural routes from urban post offices are common in population-based files; however, gecoding from such “mixed” mode postal codes has the same level of accuracy as geocoding from rural postal codes.

Some postal codes with a DMT of E, G, or M may correspond to legitimate places of residence (for instance, nursing homes, university residences, prisons, apartments in a mainly business building), but the majority are commercial or institutional establishments. Their inclusion in administrative records probably reflects reporting place of work rather than residence. PCCF+ flags such records to draw attention to the possible non-residential nature of those postal codes.

Strengths and limitations

A strength of this study was the ability to randomly sample address data from the Census of Population, the most complete survey of residential postal codes in Canada.Note 22 This made it possible to compile an analytical file that included different DMTs, representative point types, and community sizes across the country. As well, the study benefitted from the availability of Google mapping software to help manually geocode addresses using satellite photography and confirm the addresses using Google Street View. Ortho-rectified aerial photography and satellite imagery are among the most accurate methods for geocoding residences.Note 1Note 3Note 23

Two types of error arise when using self-reported address data, particularly postal codes, for spatial analysis: reporting errors and geocoding or positional/geographic assignment errors. Although this study focused on the latter, attempts were made to mitigate census reporting errors, particularly in the postal code. Errors in the first three digits (forward sortation area) generate larger spatial errors than do those in the last three digits (local delivery unit). An evaluation of the impact of address error on public health surveillance in Montreal found that of the 10% of records in the dataset with errors in the address, almost 80% were the result of inaccurate postal codes; 20% contained errors in the forward sortation area, 60% contained errors in the local delivery unit, and 20% contained errors in both.Note 24 For the present study, when possible, postal codes were verified on the Canada Post websiteNote 11; census postal codes (other than for DMT H, K or T) whose forward sortation area did not match the forward sortation area generated from the full address were excluded.

This study’s reference location used Google Maps satellite imagery to assign latitude and longitude to roof-top centroids based on full street addresses. However, a degree of error is associated with this method.Note 25Note 26 Furthermore, the quality of the satellite images varied and may have contributed to error in assigning the home centroid. This was mitigated by excluding all addresses for which a centroid could not be selected for the house or a small cluster of adjacent buildings (for example, farmhouse and adjacent barns). The resampling that resulted from poor-quality satellite imagery and lack of Google Street View may have introduced bias, because these areas are more remote and disproportionately impact certain groups, such as residents of the far North or Indian Reserves.

As indicated by the wide IQR for DMT A (80 m to 320 m), urban DMTs can, themselves, be heterogeneous, probably reflecting the density of dwellings and period of construction (post-World War II suburban development versus pre-war downtowns). Further analysis could explore the implications of such characteristics on the accuracy of geocoding from postal codes.

Conclusion

Based on address information from a random sample of census respondents, this study assesses positional accuracy of geocoding from postal codes versus full street addresses. Positional accuracy was related to mode of delivery (urban, rural, or mixed), source of latitude and longitude information (blockface, dissemination block, or dissemination area centroids), and community size. Differences across community size groups and representative point types were related to mode of mail delivery.

The results highlight the impact of delivery mode type on positional accuracy. Both rural and mixed (partly urban, partly rural) postal codes had much higher geocoding error than did urban postal codes. These findings demonstrate the importance of understanding how the positional accuracy of geocoding from postal codes differs depending on the nature of the postal codes. Because such differences can substantially alter the effect of environmental and contextual variables, first stratifying by delivery mode type or community size is recommended. The spatial resolution required by the environmental or contextual measures of interest can help analysts to identify subpopulations that should be excluded from a study.

References
Date modified: