Section 5: Data concepts

  1. About Data
  2. Statistical methods
  3. Metadata
  4. Aggregate Data
  5. Microdata
  6. Access to PUMF through DLI
  7. Classification systems used at Statistics Canada

About Data

The difference between statistical information, statistics and data

As a data professional, it is important to understand the differences between statistical information, statistics and data. These concepts were clearly articulated in Chuck Humphrey's 2004 DLI Orientation presentation, "A Framework for Thinking about Statistical Information" and are still in use today. Most of the following content heavily borrows from Humphrey's presentations on this subject.

Statistical Information

Statistical information can be described as the added value arising from the interpretation of statistics or data. This information will often be in the form of some sort of analysis, such as an article in Health Reports.

Statistics

Statistics are the numeric facts and figures which have been created from the data. They have been processed and are ready for use, but do not have the same kind of analysis behind them as statistical information does. These can take the form of e-publications, e-tables or databases.

Data

Data are numeric files created and organized for processing and analysis. There are two types of data – aggregate and microdata. Aggregate data and microdata offer the user more control over the variables offered for analysis. Further detail is available in the "Aggregate Data" and "Microdata" sections below.


Statistical methods

Statistics Canada produces statistics that help Canadians better understand their country—its population, resources, economy, society and culture. In addition to conducting a Census every five years, there are about 350 active surveys on virtually all aspects of Canadian life.

Statistics Canada is responsible for providing Canadians with reliable and comprehensive data and for giving governments, businesses, unions and not-for-profit organizations the information they need to maintain an open and democratic society.

For close to 100 years, Statistics Canada has repurposed data collected by other organizations to enhance the decision-making capability of governments and communities. These data support the creation of statistical outputs which bring evidence to policy and decision making, and save money and time.

Statistics Canada uses administrative data for statistical purposes only to complement survey data, or in lieu of surveys and to support statistical operations. Using administrative data means the agency is able to improve data quality, meet new and ongoing information needs, reduce data collection costs and save time for Canadians who respond to our surveys.

For more information regarding statistical concepts, methods, and survey design see the following resources:

Survey Methodology (12-001-X)
The journal publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves.

Survey Methods and Practices (12-587-X)
This publication shows readers how to design and conduct a census or sample survey. It explains basic survey concepts and provides information on how to create efficient and high quality surveys. It is aimed at those involved in planning, conducting or managing a survey and at students of survey design courses.

Statistics: Power from Data! (12-004-X)
Statistics: Power from Data! was created in 2001 to assist students and teachers in getting the most from statistics. This web resource was published primarily for secondary students of Mathematics and Information Studies, although it was used by other students, teachers and the general population. It was last updated in 2011.

Statistical Methods
Information and sources that describe methods of a statistical and mathematical nature used in gathering, processing and disseminating sample surveys, censuses or administrative data.


Metadata

About metadata

Metadata is the documentation that accompanies and assists users in the interpretation of microdata, aggregate data and geographic files. The information usually includes the definition of variables and description of their classification schemes, the description of the methodology used in collecting, processing and analyzing the data, and information on the accuracy of the data.

Different types of metadata

Metadata can consist of many different documents including those found in Statistics Canada's Definitions, Data Sources and Methods (formerly known as the Integrated Metadata Base or IMDB): survey questionnaires, instructions to interviewers, codebook, user's guide, record layout, data dictionary, frequency file, cv tables, etc. Please note that codebooks, record layouts, user guides and data dictionaries have overlapping properties.

  • Statistics Canada's Definitions, data sources and methods: If it is not already, this should be one of your bookmarked sites. The Definitions, data sources and methods section of the Statistics Canada site includes quick descriptions of information pertinent to the survey as well as each survey's status, frequency, questionnaire and reporting guide, description, data sources, methodology, data accuracy, target population, instrument design, sampling, error detection, imputation, estimation, quality evaluation, and disclosure control.
  • Questionnaire: This tool is helpful to assess the questions posed to the respondent and how the questions were formulated. It is very important to researchers who may have to go to the RDC - if a question is asked in the questionnaire, and not reported on the PUMF, access to the variable is only available through the RDC program. Or in some cases, there is the possibility of asking for a custom tabulations. Keep in mind that for the PUMFs, responses for some questions may not be used directly but may be used for the creation of the derived variables appearing in the PUMF. It also gives context to the question – "Was the question posed the way I thought it was?" Note: interviewer instructions are commonly included in the questionnaire.
  • Interviewer instructions: Interviewer instructions give an indication of how the data was collected and also provides an indication of skip patterns in the questionnaire (which helps explain why the population for certain variables may be lower than the total population). Other instructions can facilitate the interpretation of the data as well.
  • User's guide: The user's guide contains information to help the user interpret the survey data. It has overlapping properties with the data dictionary, record layout and codebook as it often contains all the documentation pertaining to a survey (such as the sampling methodology, population sampled, variable descriptions, position, labels, etc.).
  • Codebook: A codebook is a generic term often used to describe the user's guide, record layout and data dictionary or combinations of these documents. In its earliest usage, the codebook contained the rules for assigning numeric codes to the responses for questionnaire items. However, as applied by Statistics Canada recently (in that the data dictionary normally is assigned a "_cbk" extension), it typically provides variable-specific metadata - question text, response values, missing value declarations, variable universe, etc.
  • Record layout: The record layout provides variable names, column positions in the data file, and number of decimals. It is often distributed in .xls format - and hence, can be exported to ASCII and used to create SPSS/SAS/Stata command files. Similar to the codebook, it can provide variable breakdowns and the codes for the responses.
  • Data dictionary: The data dictionary is an excellent source to find general information about the variables in a survey, the codes for variables, missing value assignments, and frequency counts. This document has overlapping properties with the codebook, user's guide and record layout.
  • Frequency file: The frequency file contains a list of the frequencies for the responses in the dataset, that is, the number of respondents who responded to each of the possible answers for a question. Some variables are continuous and are not included in the frequency file (e.g., the weight variable). This file may also include weighted and unweighted frequencies.
  • CV tables: In order to assess the quality of data, many surveys provide the CV tables, which are the coefficients of variation. These can be simple tables, but some surveys offer bootstrap weights to calculate these in a different way. CV tables are also referred to as variability tables.

Aggregate Data

About aggregate data

Aggregate data are statistical summaries organized in a specific data file structure which permits further computer analysis (that is, data processing). Aggregate data are produced to provide access to data that cannot be released as microdata, such as the surveys based on the Business Registry in Statistics Canada, and to organize statistics into data tables.

The variables in an aggregate data file do not lend themselves to generating cross-tabulations of individuals since the initial unit of observation has been replaced by time, geography or a social construct.

Not all aggregate data contain the combination of variables from the microdata that a user may desire. For example, a researcher may be looking at whether alcohol use and gambling are correlated and wishes to know if these variables differ between men and women, by age group, and whether the results vary across Canada. Although data in the Canadian Community Health Survey (CCHS) cycle 2013-2014 are collected about the respondent's geography, gender, age, Problem gambling severity index, and alcohol use, this combination of variables may not have been used in creating an aggregate data product.

Different types of aggregate data

Aggregate data are delivered in a variety of formats, including CANSIM, Beyond 20/20, spreadsheets and databases. Increased availability of GIS software has created greater demand for Census statistics organized as aggregate data.

CANSIM offers time series data. It is an excellent source for social and economic data. It provides the data in many formats, including XML (SDMX-ML) as well as comma-separated values (CSV) which have become an important format often manipulated and analyzed using spreadsheets.

Beyond 20/20 is a free Windows based software product that allows the user to manipulate a pivot table to create and reshape a data file. Statistics Canada uses Beyond 20/20 to disseminate many of its aggregate data products. For example, the Census of Population and the Census of Agriculture make excellent use of Beyond 20/20 to disseminate data on the Statistics Canada website.

The DLI Collection has a few products (e.g., Canadian Business Patterns, Survey of Innovation, etc.) using Beyond 20/20 as a navigation software. These products are available through the DLI's Web Data Server (WDS), a web based multidimensional table viewer which allows for the dissemination of data over the web in a variety of formats.

Some aggregate data are available in a format directly usable by spreadsheets and databases. The DLI Collection holds a number of products in these formats and some examples are: Justice Statistics, Education tables, and Small Area and Administrative Data. Databases are not as common in the DLI Collection.


Microdata

About microdata

Microdata consists of the data directly observed or collected from a specific unit of observation. That is, a microdata file contains organized raw data wherein the lines represent a specific unit of measure (usually an individual, household or family) and the columns contain the values of variables.

When Statistics Canada conducts a survey, it collects information from each unit of observation (e.g., individual, household, etc.). It processes these answers by coding them using a specific number to identify the respondent's answer. For example, Statistics Canada often uses a "1" to represent males and a "2" to represent females. The microdata file is created by coding and electronically recording each survey respondent's responses to all relevant questions.

A microdata file consists of rows of numbers and letters– each row represents the respondent's responses to the questionnaire. It also consists of one logical record per respondent, where the logical record includes all responses made by a single respondent to the questionnaire. Each logical record will consist of one or more physical records (lines of data) - typically, Statistics Canada files use one physical record to describe one logical record. Since the variables are coded (rather than readable as text), the metadata must be used to describe the data file. These numbers are not revealing in themselves and therefore require metadata to help in their interpretation.

It is important to note that certain information collected in the questionnaire is not available in the data file because Statistics Canada places the utmost importance on protecting the anonymity of respondents and the confidentiality of its data (for example, the respondent's name and exact address are never included in the microdata file).

Different types of microdata

Microdata allow researchers to use any variable in the file for analysis. The properties of microdata are explored in the previous section, About microdata.

With microdata files, researchers can analyse any variable in the file, and can construct the tables they need, rather than choosing from the pre-tabulated information presented in an aggregated file.

There are four types of microdata files: master files; share files, synthetic files; and public use microdata files.

Master files

For each survey conducted by an author division, a master file is constructed which contains all responses by each respondent, recorded in the format specified on the questionnaire.

Only two types of users are permitted to access the master file – the author division (to create extractions for paying clients and for divisional analysts to perform their research) and Research Data Centre (RDC) analysts. Please note: Not all master files are available in the RDCs.

When analysis is conducted on master files, the results of the analysis must be vetted through a process called "disclosure analysis" to ensure that it conforms to the confidentiality rules established by Statistics Canada. This is to ensure that no particular respondent is identifiable.

Share files

Share files are confidential files in which the participants in the survey have signed a consent form permitting Statistics Canada to allow access to their information for approved research. These files consist of a subset of the cases in the master file. Access to share files may be granted to specific government departments without the need for their researchers to work within a Research Data Centre.

Synthetic files

Relatively few researchers can access the master file described above. Statistics Canada offers an alternative to accessing the master files by creating synthetic files, also known as "dummy files". Please note: very few surveys have synthetic files.

Synthetic files are created by the author divisions through reproducing the master file and distorting the data… but what does this mean? The files provide the full variable structure of the master file, but do not contain any real cases. So although the file looks like it has real data, the data can never be used to compile actual statistics.

These files exist to offer researchers the opportunity to work with the data file, identify the variables they want to use for analysis, create their system file and get an idea of the frequency counts of the cross-tabulations from the master file. These are not real counts, but they do provide the user with an idea of whether they want to ask the author division to run the program against the master file.

Public use microdata files (PUMFs)

Statistics Canada has about 350 active surveys, covering households, institutions, businesses and administrative data. The creation of public use microdata files is possible for areas where there is sufficient members of the universe to allow the identity of the respondent to be masked. Thus it is easier to produce such a file for the household and individual sector than it is for businesses.

Each Public Use Microdata File is based on a corresponding master data file. The modifications performed by Statistics Canada before the PUMF is released ensure that the risk of breaching confidentiality has been removed. Since the results of any analysis performed do not have to be scrutinized before they are released, the file is considered "public."

Modifications made to master files to convert them to PUMFs may include: collapsing of variables (e.g., age groups instead of individual years of age);, collapsing variables into one variable (e.g., multiple language questions collapsed into one language variable for analysis);, suppressing variables (although the variable is part of the master file, it will not show up in the public file);, and removing outliers (removing cases that are extremes - often used with income).

By using these techniques to anonymize the files, combining variables will not result in the user identifying a respondent.

Once a project team has created a public use file the output must be reviewed by the Microdata Release Committee. This committee reviews all the steps and measures taken by the team in the creation of the file. This is one of the reasons there can be a time lag between the announcement of the availability of survey results in the Daily and the actual release of a public use microdata file.

Once approval is received then copies of the file and accompanying documentation can be prepared and offered to the public. If the availability of these data has not already been announced in the Daily, then one must take place before dissemination of the public file can occur. If a data availability announcement has already been made, then it is not necessary to announce the availability of the public file. However, many divisions will make another announcement so that users will be aware that they can now access these data through another medium.

The DLI has access to synthetic files and to all PUMFs published by Statistics Canada.


Access to PUMFs through DLI

Once the file is released for public dissemination it is given to the DLI Unit. The Unit then performs some checks and verification to ensure that the data and the documentation they have received are consistent with each other. Once the DLI Unit has performed these checks the file is then mounted on the DLI EFT site and an announcement is made on the dlilist.

As mentioned, the DLI Unit performs some basic checks of the data and the metadata. Some of the aspects checked include: verifying that all components of the data and metadata have been received, verifying the record length as described by the metadata; verifying the number of records on the file to ensure proper transfer of data from the author division; etc. If the Unit finds a problem with the data and/or the documentation during this process they may have to contact the author division for clarification. Some problems can be answered immediately while others may require more time for the author division to rectify. For more information, see Michael Sivyer's article "I read it in 'The Daily'...", republished in the fall 2013 issue (Volume 14, Issue 2).

As part of an international standardization effort related to social science data, the DLI is currently preparing DDI-compliant XML-based survey files and making them available through Nesstar. As the process is predominantly manual, there is a delay between releasing a file on the DLI EFT and making it available through Nesstar.


Classification systems used at Statistics Canada

Statistics Canada uses standard classifications to facilitate the use of common characteristics or variables across surveys and administrative databases. Classifications may be updated so users should choose the appropriate version years for their needs. Concordances between versions are available to compare changes in classifications over time.

  • Industry
    • The industry classifications include the North American Industry Classification System (NAICS) Canada and the Standard Industrial Classification (SIC).
  • Product
    • The product classifications include the North American Product Classification System (NAPCS) Canada, Canadian Export Classification (CEC), Customs Tariff (CT), Standard Classification of Transported Goods (SCTG), and Standard Classification of Goods (SCG).
  • Chart of Accounts: Financial position and performance of a business
    • The standard financial reporting classification currently in use at Statistics Canada is the Chart of Accounts (COA) Canada 2006.
  • Institutional units and sectors
    • Canadian Classification of Institutional Units and Sectors, 2012 is developed based on the international version published in the System of National Account, 2008 (2008 SNA). The 2008 SNA is the latest international standard for compiling national accounts statistics.
  • Occupation
    • The occupation classifications include the National Occupational Classification (NOC), National Occupational Classification - Statistics (NOC-S), and Standard Occupational Classification (SOC).
  • Instructional programs
    • Classification of instructional programs include Classification of Instructional Programs (CIP) Canada, and Major Field of Study (MFS).
  • Geography
    • The geographic classifications include the Standard Geographical Classification (SGC) as well as other classifications of Canada. Also included are the classifications of countries and areas of interest for the world.

The full listing of classifications is available by visiting the Definitions, data sources and methods section of Statistics Canada's website and consulting the sub-section entitled "Statistical Classifications".

Date modified: