Section 6: Working with Data

  1. Using Microdata Files
  2. Statistical Analysis Software
  3. Using Software
  4. Geospatial Data
  5. Using Geospatial Software

Using Microdata Files

Microdata files must be used together with the metadata that describes them. Using the metadata and command files, system files are created which allow the user to perform extractions from the data file and make sense of the results. The steps necessary, or useful, in combining microdata and metadata are listed below. First, terminology is provided to help you understand the steps used in creating a microdata file.

Terminology

  • Command file: The command file, also known as the syntax file, the data set description, the setup file or the create file, is created to define a microdata file. It is written in a statistical analysis software language (e.g., SPSS, SAS). The command file usually provides the name of the microdata dataset, the variable locations (column locations and decimal declarations), variable names, variable labels, and missing value locations.
  • System file: When the statistical analysis software package runs the command file against the raw data file, a system file is created, which can be saved. It is format specific to the statistical analysis software package you are using (e.g., SPSS, SAS).

Getting the file ready for use

The following steps should give you an idea of the process used to get microdata files ready for use:

  1. Locate and download the data file (PUMF or synthetic file) and check whether or not a system file is already available for your desired statistical software package.
  2. Download the metadata which accompanies the data file. This may include a command file for the software package you wish to use. If a command file is included, you will have to modify it to point to the raw data file where you have downloaded it (after which you should save the revised command file). Skip to step 4.
  3. If the command file is not part of the DLI Collection, it must be created using the file's record layout, data dictionary, user's guide, etc. Although these resources will provide you with the position of the field and variable labels, you will need to enter the programming text to run the program (for example, where to find the data set, where to save the dataset, programming lingo for the program to recognize the variable fields and labels, etc.). *A good tip for the less experienced user is to use an existing command file (either from a previous cycle of the survey or from another survey) and adjust it to meet your needs by replacing the key components. Another great suggestion is to post a question on the dlilist to see if someone from the DLI Community has created the file already.
  4. Once you run the command file without encountering any errors, you have created your system file. Save the system file in the same directory as the raw data and command files, being sure to assign it a file name that does not overwrite either the raw data or command file. You can start running frequency counts or cross-tabulations. A very good idea is to check the frequency counts of your system file with those published by Statistics Canada. You can usually find Statistics Canada's frequency counts in Odesi or Nesstar. If you find differences between them, you may have made an error in your programming.

Check out Chuck Humphrey's presentation, "Coping with SPSS Syntax Files on the DLI FTP Site", to answer many of the questions DLI Contacts commonly pose.


Statistical Analysis Software

Statistical analysis software is necessary to make microdata files useable. It is used to combine the microdata and metadata (in the form of a command file) to create a system file which can be used for analysis.

About statistical analysis software

Statistical analysis software are comprehensive systems for analyzing data. They can take data from almost any type of file and use them to generate tabulated reports, charts, and plots of distribution and trends, descriptive statistics and complex statistical analyses.

Different types of statistical analysis software

There are many different types of statistical software packages available. Three of the more common ones are SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System) and STATA. These packages are command-based, and are available for Windows, Macintosh, and UNIX operating systems.  Open-source R is becoming increasingly used for work with numeric and geospatial data as well as for data visualization.

Michelle Edwards, University of Guelph, created "SPSS, STATA, and SAS: Flavours of Statistical Software" an excellent tutorial which identifies the differences between these statistical analysis software.

Converting to other formats

Options are available if a researcher wishes to use the data with software different from the currently available system file.

Stat/Transfer is a great program to convert from one statistical analysis software to another. This allows, for example, an SPSS file to be read by SAS and vice-versa. It is not expensive and is very user friendly.

If the currently available system file is stored in SPSS, and you are running SPSS version 14 or higher, you can save directly from SPSS to other statistical formats (SAS, Stata) using the SAVE AS command from the FILE menu.


Using Software

Beyond 20/20

The Beyond 20/20 Browser is a free Windows-based program used to view Beyond 20/20 tables and extracts. Beyond 20/20 is a data browser used at Statistics Canada to organize, manage and disseminate socioeconomic data.

Beyond 20/20's pivoting and nesting capabilities make it easy to switch dimensions and show more than one dimension along rows and columns. In addition, Beyond 20/20's dynamic data format enables you to quickly and easily integrate and manipulate information from your own data sources.  It allows users to display data from different perspectives, perform calculations on data, create simple charts, and save data in formats for use in other programs (along with other operations). Most notably, B20/20 allows you to save extracted data in dbase format (.dbf), which is the best option for use with ArcGIS, see Using Geospatial Software section below. For more information on using Beyond 20/20, see Richard Boily and Siobhan Hanratty's Beyond 20/20 presentation from the DLI Bootcamp 2011.

Exporting PDF file tables into Excel

Many PDF files contain tables of statistics which cannot be manipulated by users in this format. These tables can be exported into an Excel Spreadsheet.  To convert tables of statistics into Excel, either the full version of Adobe is required, or specialized software such as PDF2Excel.  Some open source solutions exist as well. Note: PDF to Excel cannot be done if the PDF document is an image.

Importing Excel file tables into SPSS

In order to import Excel data into SPSS, make sure the Excel spreadsheet is formatted as follows:

  • The spreadsheet should have a single row of variable names across the top of the file, and each variable name should begin with letters.
  • The data should begin in the first column and second row of the Excel file.

To open an Excel file, select File/Open/Data… or File/Read text data from the menu in the Data Editor window in SPSS. Select the format of your file in the dropdown menu besides "Files of type:". Select the Excel file from your folder directory. Next, a dialog box will appear. Since the variable names are on the top row in Excel, leave the Read variable names from the first row of data checked. Next, select the desired Worksheet from the drop-down menu. You may ignore the remaining options and choose OK. You should now see data in the Data Editor window. Check to make sure that all variables and cases were read correctly. Next, save your dataset in SPSS format by choosing the Save option in the File menu.

Exporting an SPSS file to Excel

To export an SPSS file to Excel, click on File/Save as... from the menu in the Data Editor window in SPSS. Change the Save as type… to Excel 2007 through 2010 (*.xlsx) and ensure that Write variable names to spreadsheet is checked off. Select whether you wish to Save value labels where defined instead of data values by checking off that option. Variable information such as missing values and labels are not included in the exported Excel file. Next, select the appropriate folder into which you wish to save the file. In Excel, the variable names will be on a single row across the top of the file, while the data will begin in the first column and second row.

GeoSuite

Using the free GeoSuite software, users are able to find population and dwelling count data for all standard geographic areas in a given census year, determine dissemination area correspondence between the current and previous census years, and explore the links between geographic areas/geographic units. For example, you can use GeoSuite to list all Census Subdivisions within a Census Metropolitan Area or Census Agglomeration or list all CSDs around a CMA/CA and determine how heavily influenced they are by the CMA/CA. This resource is ideal for understanding the hierarchical relationship among related geographic units. As of 2016, GeoSuite Web is now available online. The application has most of the functionalities of GeoSuite, downloadable version (MS-Access).

Postal CodeOM Conversion File

The Postal CodeOM Conversion File (PCCF) is a digital file that provides a correspondence between the six-character postal code and geographical areas for which census data and other statistics are produced. The PCCF can be used to link data with postal code identifiers to census characteristics at any standard level of geography (e.g., dissemination or enumeration area, census tract). It can be used to create files for any time period since the introduction of postal codes, so data identified with old postal codes can be merged to the current geographic classification.


Geospatial Data

The difference between maps and geospatial data

Static Maps

Static maps are published for every census year. As of 2011, static Census maps are only available electronically in PDF. Two map products are usually published: reference maps;, and thematic maps. Reference maps show the geographic areas for which census data are tabulated and disseminated. Thematic maps show the spatial distribution of one or more specific data themes for standard geographic areas. These maps are meant to be printed as opposed to geospatial data that are used to create your own maps in a geographic information system (GIS) such asESRI ArcGIS or QGIS.

These maps are available for free download from a given year's Census website under Geography: Reference maps and Thematic maps (examples given are from the 2011 site).

Interactive maps

The Maps and geography section of the Statistics Canada site also contains links to a section for Interactive maps (under "Maps"). This section links to maps showing various places and locations, census and non-census boundaries, patterns, and distribution based on interaction with the user and the map (including mapping applications, and Data products with mapping applications).

Geospatial data

Geospatial data define one or more geographic areas and their dimensions, using points, lines, polygons, or pixels. Using geographic information system (GIS) software such as the proprietary programs Esri ArcGIS or MapInfo or open source programs such as Quantum GIS, statistics can be combined with geospatial data using a common key to create thematic maps. Geospatial files are joined to census statistics through the use of a common ID field, usually the unique ID for every geographic unit.

tatistics Canada's Reference Materials

Statistics Canada's Census pages contain two great reference sections which will help researchers learn about the intricacies of STC geography: Reference materials, which includes a link to the Census Dictionary, and Reference documents within the Geography section, which includes links to the Illustrated Glossary and the Geography Catalogue.

Geographic Products

Digital Boundary Files

Digital Boundary Files (DBFs) display the official boundaries used for Census collection and, therefore, often extend as straight lines into bodies of water. Unlike the DBF, Cartographic Boundary Files (CBFs) are modified to follow the coastlines and shorelines on the perimeter of Canada's land mass, including major islands.

Boundary files are published in ArcGIS, Geography Markup Language and MapInfo formats. Boundary files have been available in Shapefile (.shp) format since 2006 and in interchange (e00) format prior to this. If using Esri software, in order to use interchange files, they must be reformatted for use in ArcGIS using tools built into ArcCatalog, while shape files can be used directly in ArcGIS. The MapInfo files are published in MIF/MID interchange format (must import before using) or MapInfo TAB format.

In addition to the Statistics Canada website, DLI contacts can also download boundary files for preceding census years dating back to 1971 on the DLI EFT site and the DLI WDS server.

Please note: The older executable files you will download from the FTP site require an older operating system to run. For spatial files prior to 2001 you may need to run an older version of Windows or have Windows XP Mode and Windows Virtual PC running on your Windows  machine (particularly if you are running a 64-bit version of Windows).

Road Network Files / Block Boundary Files

The Road Network Files (RNFs) provide national coverage of roads, province/territory boundaries, and other visible features such as hydrography, as well as attribute information (for example, street names and address ranges for streets with assigned addresses). The road layer in RNFs includes geographic codes to identify blocks, census subdivisions and census metropolitan areas/census agglomerations (as polygon attributes). The only way to access block geographic areas is through the RNF.


Using Geospatial Software

Many of the geographic products, such as boundary files, can only be opened in software that can read these files installed on your computer. Geographic information system (GIS) software is used to read geospatial data, such as shapefiles (.shp).  If you do not have such software, you can download software that can read these files. Alternatively, you can also convert these files to other formats that could be read by other software. These file formats may not be accessible by some devices.

ArcInfo

Barbara Znamirowski, Trent University, Nancy Lemay, University of Ottawa, and Jenny Marvin, University of Guelph are the authors of "Using Statistics Canada Geospatial Data with ArcGIS 9x (ArcInfo)". The primary intent of this presentation is to provide practical training in using Statistics Canada geography files with the leading industry standard software: Environmental Systems Research Institute, Inc.(Esri) ArcGIS 9x. Readers will be introduced to the key features of ArcGIS 9x, as well as to geographic concepts and principles essential to understanding and working with geographic information systems (GIS) software. There is also an accompanying exercise to this presentation.

Peter Peller, University of Calgary, and Daniel Brendle-Moczuk, University of Victoria are the authors of "Back to the Basics: The Fundamentals of Working With Statistics Canada Boundary Files".  The session material reviews the basic knowledge and skills that DLI contacts need to work with Census boundary files such as the differences between Digital Boundary Files and Cartographic Boundary Files, projections, feature selection, new layer creation, clipping & splitting, and spatial joins.

Other

Natalie O'Toole and Peter Peller demonstrated the use of PSPP and Quantum GIS to extract map DLI data in their presentation "Using Free, Open-Source Tools to Extract and Map DLI Data."