Statistics Canada
Symbol of the Government of Canada

Section 4: Data

About Data
Metadata
Aggregate Data
Microdata
Using Microdata Files
Statistical Analysis Software

About Data

The difference between statistical information, statistics and data

As a data professional, it is important to recognise the distinction between statistical information, statistics and data.  This important difference was introduced to the DLI community primarily by Chuck Humphrey, University of Alberta.

The following are the highlights of Chuck Humphrey’s “DLI Orientation: A Framework for Thinking about Statistical Information” which describes the distinction between the three.

Statistical Information

Statistical information consists of the facts and figures displayed in statistical tables produced from data and is the added value arising from the interpretation of the numbers.

Statistics

Statistics are the tables and cross-tabulations that have been formulated from the raw data files.  These can take the form of e-publications, e-tables or databases.  These statistics may be:

  • Produced by users with the help of databases (such as CANSIM or Beyond 20/20) to organise the data based on their own needs.  Users can define the level of geography, characteristics of the population, etc. and create a customised view of the data. View an example of a database in PDF format.
  • Produced by Statistics Canada to answer the most frequently posed questions by their users.  These are known as e-tables. They are static in nature which results in no customization by the users. View an example of an e-table in PDF format.

Data

Data are numeric files created and organized for analysis. There are two types of data – aggregate and microdata.   Aggregate data and microdata are two sides of the same coin – they offer the user more control over the variables offered for analysis.

Aggregate data are statistical summaries organized in a specific data file structure that permits further computer analysis, that is, data processing.  Aggregate data can also be used for display purposes since the data structure consists of statistics previously processed.  Aggregate data are found in numerous formats, which are explored in the section Different types of aggregate data.  Some of the more common ways to access aggregate statistics are through CANSIM, Beyond 20/20 tables, spreadsheets, databases, etc.

Microdata consist of the data directly observed or collected from a specific unit of observation. For a typical Statistics Canada microdata file the unit of observation is probably an individual, a household or a family.

The microdata file is composed of individual records consisting of a row of numbers. Microdata require processing before they become ready for interpretation. In order to make use of a microdata file, metadata must be consulted to identify variables and statistical software employed to analyze the data.   You will find more detailed information about using microdata files below in the section Using microdata files.

Classification systems used at Statistics Canada

Statistics Canada uses standard classifications to facilitate the use of common characteristics or variables across surveys and administrative databases.  Classifications may be updated (as noted below by the version years listed).  Concordances between versions are available to compare changes in classifications over time.

The most commonly used classifications are:

  • Industry
    • NAICS – North American Industrial Classification System (1997, 2002, 2007)
    • SIC – Standard Industrial Classification (1980)
  • Occupation
    • NOC-S –  National Occupational Classification - Statistics (2006, 2001)
    • SOC – Standard Occupational Classification (1991)
  • Product
    • SCG – Standard Classification of Goods (2000 and 2001)
    • SCTG – Standard Classification of Transported Goods (1996)
  • Geography
    • SGC – Standard Geographical Classification (1996, 2001, 2006)
    • Health Regions (2000, 2003, 2005)

The full listing of classifications is available by visiting the Definitions, data sources and methods section of Statistics Canada’s website and consulting the sub-section entitled “Standard classifications”.


Metadata

About metadata

Metadata is the documentation that accompanies and assists users in the interpretation of microdata, aggregate data and geographic files. The information usually includes the definition of variables and description of their classification schemes, the description of the methodology used in collecting, processing and analysing the data, and information on the accuracy of the data.

Different types of metadata

Metadata can consist of many different documents including those found in Statistics Canada’s Definitions, Data Sources and Methods (also known as the Integrated Metadata Base or IMDB): survey questionnaires, instructions to interviewers, codebook, user’s guide, record layout, data dictionary, frequency file, cv tables, etc.  Please note that codebooks, record layouts, user guides and data dictionaries have overlapping properties.

  • Statistics Canada Definitions, data sources and methods (IMDB): If it is not already, this should be one of your best friends!  IMDB is the common name used to describe the list of surveys performed by Statistics Canada. The IMDB includes quick descriptions of information pertinent to the survey.  The STC IMDB provides each survey’s status, frequency, questionnaire and reporting guide, description, data sources, methodology, data accuracy, target population, instrument design, sampling, error detection, imputation, estimation, quality evaluation, and disclosure control.

  • Questionnaire: This tool is helpful to assess the questions posed to the respondent and how the questions were formulated.  It is very important to researchers who may have to go to the RDC - if a question is asked in the questionnaire, and not reported on the PUMF, access to the variable is only available through the RDC program. It also gives context to the question – “Was the question posed the way I thought it was?” Note: interviewer instructions are commonly included in the questionnaire.

  • Interviewer instructions: Interviewer instructions give an indication of how the data was collected and also provides an indication of skip patterns in the questionnaire (which helps explain why the population for certain variables may be lower than the total population).  Other instructions can facilitate the interpretation of the data as well.

  • User’s guide:  The user’s guide contains information to help the user interpret the survey data. It has overlapping properties with the data dictionary, record layout and codebook as it often contains all the documentation pertaining to a survey (such as the sampling methodology, population sampled, variable descriptions, position, labels, etc.).

  • Codebook: A codebook is a generic term often used to describe the user’s guide, record layout and data dictionary or combinations of these documents.  In its earliest usage, the codebook contained the rules for assigning numeric codes to the responses for questionnaire items. However, as applied by Statistics Canada recently (in that the data dictionary normally is assigned a “_cbk” extension), it typically provides variable-specific metadata - question text, response values, missing value declarations, variable universe, etc.

  • Record layout: The record layout provides variable names, column positions in the data file, and number of decimals.  It is often distributed in .xls format - and hence, can be exported to ASCII and used to create SPSS/SAS/Stata command files. Similar to the codebook, it can provide variable breakdowns and the codes for the responses.

  • Data dictionary: The data dictionary is an excellent source to find general information about the variables in a survey, the codes for variables, missing value assignments, and frequency counts.  This document has overlapping properties with the codebook, user’s guide and record layout.

  • Frequency file: The frequency file contains a list of the frequencies for the responses in the dataset, that is, the number of respondents who responded to each of the possible answers for a question. Some variables are continuous and are not included in the frequency file (e.g., the weight variable). This file may also include weighted and unweighted frequencies.

  • CV tables: In order to assess the quality of data, many surveys provide the CV tables, which are the coefficients of variation.  These can be simple tables, but some surveys offer bootstrap weights to calculate these in a different way.  CV tables are also referred to as variability tables.


Aggregate Data

Aggregate data are produced to provide access to data that cannot be released as microdata, such as the surveys based on the Business Registry in Statistics Canada, and to organize statistics into data tables.

About aggregate data

Chuck Humphrey, University of Alberta, defines aggregate data as: “…consist[ing] of statistics that are organized into a data structure and stored in a database or in a data file. The data structure is based on tabulations organized by time, geography, or social content.” 

The variables in an aggregate data file do not lend themselves to generating cross-tabulations of individuals since the initial unit of observation has been replaced by time, geography or a social construct.

Not all aggregate data contain the combination of variables from the microdata that a user may desire.  For example, a patron may be looking at whether alcohol use and gambling are correlated and wishes to know if these variables differ between men and women, by age group, and whether the results vary across Canada.  Although data in the Canadian Community Health Survey (CCHS) 3.1 are collected about the respondent’s geography, gender, age, Canadian Problem Gambling Index, and alcohol use, this combination of variables may not have been used in creating an aggregate data product.

A more detailed description of aggregate data is available in Chuck Humphrey’s presentation “DLI Orientation: A Framework for Thinking about Statistical Information”. 

Different types of aggregate data

Aggregate data are delivered in a variety of formats, including CANSIM, Beyond 20/20, spreadsheets and databases.

CANSIM offers time series data.  It is an excellent source for social and economic data.  It provides the data in many formats, including comma-separated values (CSV) which have become an important format often manipulated and analysed using spreadsheets.

Beyond 20/20 is software that allows the user to manipulate a pivot table to create and reshape a data file.  Increased availability of GIS software has created greater demand for Census statistics organized as aggregate data.

Statistics Canada uses Beyond 20/20 to disseminate many of its aggregate data products.  It is primarily used to display business or administrative data. The Census of Population and the Census of Agriculture make excellent use of Beyond 20/20 to disseminate its data on the Statistics Canada website. The DLI Collection has a few products (e.g., Canadian Business Patterns, Survey of Innovation, etc.) using Beyond 20/20 as a navigation software.

Some aggregate data are available in a format directly usable by spreadsheets and databases.  The DLI Collection holds a number of products in these formats and some examples are: Justice Statistics; Education tables; and Small Area and Administrative Data.  Databases are not as common in the DLI Collection. 


Microdata

About microdata

“Microdata are raw data organized in a file where the lines in the file represent a specific unit of observation and the information on the lines are values of variables.” [See Humphrey]

Perhaps it would be easiest to visualise a microdata file to explore this topic further.

When Statistics Canada conducts a survey, it collects information from each unit of observation (e.g., individual, household, etc.).  It processes these answers by coding them using a specific number to identify the respondent’s answer.  For example, Statistics Canada often uses a “1” to represent males and a “2” to represent females.  The microdata file is created by coding and electronically recording each survey respondent’s responses to all relevant questions.

A microdata file consists of rows of numbers and letters– each row represents the respondent’s responses to the questionnaire. A microdata file consists of one logical record per respondent, where the logical record includes all responses made by a single respondent to the questionnaire. Each logical record will consist of one or more physical records (lines of data) - typically, Statistics Canada files use one physical record to describe one logical record. Since the variables are coded (rather than readable as text), the metadata must be used to describe the data file. These numbers are not revealing in themselves and therefore require metadata to help in their interpretation.

It is important to note that certain information collected in the questionnaire is not available in the data file because Statistics Canada places the utmost importance on protecting the anonymity of respondents and the confidentiality of its data (for example, the respondent’s name and exact address are never included in the microdata file). 

Different types of microdata

Microdata allow researchers to use any variable in the file for analysis.  The properties of microdata are explored in the previous section, About microdata

With microdata files, researchers can analyse any variable in the file, and can construct the tables they need, rather than choosing from the pre-tabulated information presented in an aggregated file.

There are three types of microdata files: master files; synthetic files; and public use microdata files.            

Master files

For each survey conducted by an author division, a master file is constructed which contains all responses by each respondent, recorded in the format specified on the questionnaire.

Only two types of users are permitted to access the master file – the author division (to create extractions for paying clients and for divisional analysts to perform their research) and Research Data Centre (RDC) analysts. Please note:  *Not all master files are available in the RDCs.

When analysis is conducted on master files, the results of the analysis must be vetted through a process called “disclosure analysis” to ensure that it conforms to the confidential rules established by Statistics Canada”.  This is to ensure that no particular respondent is identifiable.

Share files

Chuck Humphrey, University of Alberta, adds that share files “are confidential files in which the participants in the survey have signed a consent form permitting Statistics Canada to allow access to their information for approved research. These files consist of a subset of the cases in the master file.”

Synthetic files

Relatively few researchers can access the master file described above.  Statistics Canada offers an alternative to accessing the master files by creating synthetic files, also known as “dummy files”.  Please note: very few surveys have synthetic files.

Synthetic files are created by the author divisions through reproducing the master file and distorting the data… but what does this mean?  The files provide the full variable structure of the master file, but do not contain any real cases.  So although the file looks like it has real data, the data can never be used to compile actual statistics.

These files exist to offer researchers the opportunity to work with the data file, identify the variables they want to use for analysis, create their system file and get an idea of the frequency counts of the cross-tabulations from the master file.  These are not real counts, but they do provide the user with an idea of whether they want to ask the author division to run the program against the master file.

A more complete explanation is available in the article by Sage Cram, member of the DLI Team at Statistics Canada “What is a Synthetic File?”, published in the DLI Update Fall 2004 Volume 7 issue 1 (PDF).

Remote Job Submission is the primary way of obtaining data using synthetic files.  This dissemination channel is explored in Chuck Humphrey’s Continuum of Access guide that describes the primary dissemination channels at Statistics Canada.

Public use microdata files (PUMFs)

Each Public Use Microdata File is based on a corresponding master data file. The modifications performed by Statistics Canada before the PUMF is released ensure that the risk of breaching confidentiality has been removed. Since the results of any analysis performed do not have to be scrutinized before they are released, the file is considered “Public”.

Modifications made to master files to convert them to PUMFs may include: collapsing of variables (e.g., age groups instead of individual years of age); collapsing variables into one variable (e.g., multiple language questions collapsed into one language variable for analysis); suppressing variables (although the variable is part of the master file, it will not show up in the public file); and removing outliers (removing cases that are extremes - often used with income).

By using these techniques to anonymise the files, combining variables will not result in the user identifying a respondent.

The DLI has access to some synthetic files and to all PUMFs published by Statistics Canada.


Using Microdata Files       

Microdata files must be used together with the metadata that describes them. Using the metadata, system files are created which allow the user to perform extractions from the data file and make sense of the results.  The steps necessary, or useful, in combining microdata and metadata are listed below.  First, terminology is provided to help you understand the steps used in creating a microdata file.

Terminology

  • Command file: The command file, also known as the data set description, the setup file or the create file, is created to define a microdata file. It is written in a statistical analysis software language (e.g., SPSS, SAS). The command file usually provides the name of the microdata dataset, the variable locations (column locations and decimal declarations), variable names, variable labels, and missing value locations.
  • System file:  When the statistical analysis software package runs the command file against the raw data file, a system file is created, which can be saved. It is format specific to the statistical analysis software package you are using (e.g., SPSS, SAS).

Getting the file ready for use

The following steps should give you an idea of the process used to get microdata files ready for use:

  1. Locate and download the data file (PUMF or synthetic file).
  2. Download the metadata which accompanies the data file. This may include a command file for the software package you wish to use. If a command file is included, you will have to modify it to point to the raw data file where you have downloaded it (after which you should save the revised command file). Skip to step 4.
  3. If the command file is not part of the DLI Collection, it must be created using the file’s record layout, data dictionary, user’s guide, etc.  Although these resources will provide you with the position of the field and variable labels, you will need to enter the programming text to run the program (for example, where to find the data set, where to save the dataset, programming lingo for the program to recognise the variable fields and labels, etc.).  *A good tip for the less experienced user is to use an existing command file (either from a previous cycle of the survey or from another survey) and adjust it to meet your needs by replacing the key components.  Another great suggestion is to post a question on the dlilist to see if someone from the DLI Community has created the file already.
  4. Once you run the command file without encountering any errors, you have created your system file. Save the system file in the same directory as the raw data and command files, being sure to assign it a file name that does not overwrite either the raw data or command file. You can start running frequency counts or cross-tabulations.  A very good idea is to check the frequency counts of your system file with those published by Statistics Canada.  If you find differences between them, you may have made an error in your programming.

Check out Chuck Humphrey’s presentation, “Coping with SPSS Syntax Files on the DLI FTP Site”, to answer many of the questions DLI Contacts commonly pose. 


Statistical Analysis Software    

Statistical analysis software is necessary to make microdata files useable. It is used to combine the microdata and metadata (in the form of a command file) to create a system file which can be used for analysis.

About statistical analysis software

Statistical analysis software is a comprehensive system for analysing data.  It can take data from almost any type of file and use them to generate tabulated reports, charts, and plots of distribution and trends, descriptive statistics and complex statistical analyses.

Different types of statistical analysis software

There are many different types of statistical software packages available. Three of the more common ones are SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System) and STATA. These three packages are general purpose statistical software packages, are command-based, and are available for Windows, Macintosh, and UNIX operating systems.

Michelle Edwards, University of Guelph, created  “SPSS, STATA, and SAS: Flavours of Statistical Software” an excellent tutorial titled which identifies the differences between these statistical analysis softwares. 

Converting to other formats

Options are available if a researcher wishes to use the data with software different from the currently available system file.

Stat/Transfer is a great program to convert from one statistical analysis software to another.  This allows, for example, an SPSS file to be read by SAS and vice-versa.  It is not expensive and is very user friendly.

If the currently available system file is stored in SPSS, and you are running SPSS version 14 or higher, you can save directly from SPSS to other statistical formats (SAS, Stata) using the SAVE AS command from the FILE menu.