5. Data processing

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

5.1 Data capture

Responses to survey questions were captured directly by the interviewer at the time of the interview using a computerized questionnaire. A computerized questionnaire reduces processing time and costs associated with data entry, transcription errors and data transmission.

Some editing of data was done directly at the time of the interview. Specifically, where a particular response appeared to be inconsistent with previous answers or outside of expected values, the interviewer was prompted, through message screens on the computer, to confirm answers with the respondent, and, if needed, to modify the information.

5.2 Social survey processing steps

Data processing involves a series of steps to convert the electronic questionnaire responses from their initial raw format to a high-quality, user-friendly database involving a comprehensive set of variables for analysis.  A series of data operations are executed to clean files of inadvertent errors, rigorously edit the data for consistency, code open-ended questions, create useful variables for data analysis, and finally to systematize and document the variables for ease of analytical usage.

The 2012 APS used a new set of social survey processing tools developed at Statistics Canada called the “Social Survey Processing Environment” (SSPE). The SSPE involves SAS software programs, custom applications and manual processes for performing the following systematic steps:

Processing steps:

  • Receipt of raw data
  • Clean up
  • Recodes
  • Flows
  • Coding
  • Edits and imputations
  • Derived variables
  • Creation of final processing file
  • Creation of dissemination files

5.3 Receipt of raw data and record clean up

Following the receipt of raw data from the electronic questionnaire applications, a number of preliminary cleaning procedures were implemented for the 2012 APS at the level of individual records.  These included the removal of all personal identifier information from the files, such as names and addresses, as part of a rigorous set of ongoing mechanisms for protecting the confidentiality of respondents. Duplicate records were resolved at this stage. Also part of clean up procedures was the review of all respondent records to ensure each respondent was “in-scope” and had a sufficiently completed questionnaire. (Note that the criteria to determine whether or not a respondent was in scope was applied before any edit or imputation was done).  Specific criteria for determining who would be a final APS respondent and who would not be a final APS respondent are provided below.

5.3.1 Definition of a respondent

  • To be “in scope”, respondents must have been at least 6 years of age as of February 1, 2012 and met a minimum of one Aboriginal identity criterion (see section 2.2 for complete criteria).
  • To have a “complete” questionnaire, respondents aged 6 to 14 must have provided valid responses (i.e. not “Don’t know” or “Refused”) to specified key questions in the areas of education or health.
  • To have a “complete” questionnaire, respondents aged 15 and older must have provided valid responses (i.e. not “Don’t know” or “Refused”)  to specified key questions in either the area of education, or the areas of labour and health.

Those that did not meet the above criteria were removed from the database. As per the rules above, all “partial” respondents, who were in-scope according to part 1 of the definition but who did not fulfill the content-completion requirements in part 2 or part 3 of the definition, were among those removed from the final database. Please refer to section 6.4 of this document for more information on partial respondents. 

5.4 Variable recodes and multiple response questions

This stage of processing involved changes at the level of individual variables. Variables could be dropped, recoded, re-sized or left as is. Formatting changes were intended to facilitate processing as well as analysis of the data by end-users.  One such change was the conversion of multiple-response questions (“Mark-all-that-apply” questions) to corresponding sets of single-response variables which are easier to use. For each response category associated with the original question, a variable was created with yes/no response values. An example is provided below.

Original multiple-response question:

ED4_Q11A - What were the reasons you did not finish your postsecondary education?

  • INTERVIEWER: Mark all that apply.
  • 01 Pregnant/Caring for own child(ren)
  • 02 Other family responsibilities
  • 03 Own illness / Disability
  • 04 Financial reasons (not enough money)
  • 05 Lost interest / Lack of motivation
  • 06 Got a job / Wanted to work
  • 07 Too old or too late now
  • 08 Courses too hard / Bad results
  • 09 Too difficult to be away from home
  • 10 Prejudice and racism
  • 11 Moved
  • 12 Other - Specify
  • DK, RF

Final variables in single-response yes/no format:

ED4_Q11AA - What were the reasons you did not finish your postsecondary education?
- Pregnant/Caring for own child(ren)

  • 1 Yes
  • 2 No
  • DK, RF

ED4_Q11AB - What were the reasons you did not finish your postsecondary education?
- Other family responsibilities

  • 1 Yes
  • 2 No
  • DK, RF

ED4_Q11AC - What were the reasons you did not finish your postsecondary education?
- Own illness / Disability

  • 1 Yes
  • 2 No
  • DK, RF
  • ...additional Yes-No questions for each response category, as indicated, from “Financial reasons (not enough money)” to “Moved”... and including the last category:

ED4_Q11AL - What were the reasons you did not finish your postsecondary education?
- Other - Specify

  • 1 Yes
  • 2 No
  • DK, RF

5.5 Flows: response paths, valid skips and question non-response

Another set of data processing procedures for the 2012 APS was the verification of questionnaire flows or skip patterns.  All response paths and skip patterns that were built into the questionnaire were verified to ensure that the universe or target population for each question was accurately captured during processing. Special attention was paid to distinctions between valid skips and non-response, an important distinction for statistical analysis. These concepts are explained below in order to assist users to better understand question universes as well as statistical outputs for APS survey variables.

Response – an answer directly relevant to the content of the question that can be categorized into pre-existing answer categories, including “Other-specify”.

Valid skip – indicates that the question was skipped because it did not apply to the respondent’s situation, as determined by valid answers to a previous question.  In such cases, the respondent is not considered to be part of the target population or universe for that question. As noted below, where a question was skipped due to an undetermined path (that is, a “Don’t know” or “Refusal” to a previous question caused the skip), the respondent is coded to “Not stated” for that question.
 
Don’t know – the respondent was unable to provide a response for one or more reasons (due to lack of recall, or because they were responding for someone else, for example).

Refusal – the respondent refused to respond, perhaps due to the sensitivity of the question.

Not stated – this indicates that the question response is missing and there is an undetermined path for the respondent, such as when a respondent did not answer the previous filter question or where an inconsistency was found in a series of responses.

Special codes have been designated to each of these types of responses to facilitate user recognition and data analysis.  For instance, “valid skip” codes are set to “6” as the last digit, with any preceding digits set to “9” (for example, code would be “996” for a 3-digit variable).  All “Don’t know” responses end in “7”, with any preceding digits set to “9” (for example, “997”). Refusals end in “8”, with any preceding digits set to “9” (for example, “998”); and “Not stated” values end in 9, with any preceding digits set to “9” also (for example, “999”).

5.6 Coding

5.6.1 “Other-specify” items

Data processing also includes the coding of “Other-specify” items, also referred to as “write-in responses”.  For most questions on the APS questionnaire, pre-coded answer categories were supplied and the interviewers were trained to assign a respondent’s answers to the appropriate category. However, in the event that a respondent’s answer could not be easily assigned to an existing category, many questions also allowed the interviewer to enter a long-answer text response in the “Other-specify” category.

All questions with “Other-specify” categories were closely examined during processing. Based on a qualitative review of the types of text responses given, coding guidelines were developed for each question. Based on these coding guidelines, many of the long answers provided were re-coded back into one of the pre-existing listed categories. Responses that were unique and qualitatively different from existing categories were kept as “Other”. For some questions, one or more new categories were created when there were sufficient numbers of responses to warrant them.  In the case of questions where “Other-specify” responses constituted less than 5% of overall responses to the question, coding was not performed and responses were left in “Other”.

Approximately 58,000 responses across 78 questionnaire items were recorded under “Other – specify” and reviewed for coding.  Appendix B summarizes the extra categories added for the 2012 APS.  These will be taken into account when refining the answer categories for future cycles of the survey.

5.6.2 Open-ended questions and standard classifications

A few questions on the 2012 APS questionnaire were recorded by interviewers in a completely open-ended format.  These included questions related to the respondent’s occupation and industry of work as well as their major field of post-secondary study, where applicable. These responses were coded using a combination of automated and interactive coding procedures. Standardized classification systems were used to code these responses.  Appendix C provides details of these classifications.

A standardized classification was also used to code Aboriginal languages that respondents spoke or understood as well as the first language learned in childhood.  For languages, interviewers had been provided a comprehensive drop-down menu of languages to choose from, but write-in responses were also captured as needed. Overall, 51 Aboriginal language categories were used to code APS language data.  For details on the classification system used for Aboriginal languages, see Appendix C.

Coding for all classifications involved experienced coding and quality control as well as additional processing verification procedures.

5.7 Edit and imputation

After the coding stage of processing, a series of customized edits were performed on the data. These consisted of validity checks within and across variables to identify gaps, inconsistencies, extreme outliers and other problems in the data. To resolve the problematic data identified by the edits, corrections were performed based on logical edit rules. In some cases, corresponding data were taken from the respondent’s answers to the National Household Survey. This is referred to as imputation.

An example of a validity check within a single question is the housing variable related to the number of rooms in the dwelling, which allowed for an interviewer to record up to 95 rooms. To remove outlier responses that were suspected of being invalid, an edit was built to ensure that the reported number of rooms in the dwelling did not exceed an upper limit of 20. As another example, many consistency edits across questions were performed in relation to education variables to avoid any contradictions in education profiles. For example, a person who had not reported ever having attended a specific post-secondary educational institution such as a university, a trade school, a college, CEGEP or other non-university institution, and then subsequently reported currently working toward a certificate, diploma or degree from one of these institutions, was assumed to have attended that type of institution. The response to the earlier question was changed from a “no” to a “yes” for the specific type of institution where the edit was required.

For the 2012 APS, a series of important imputations was conducted in relation to Aboriginal identity classifications. For example, those with missing data for questions ID_Q02 on Aboriginal identity group, ID_Q03 on Registered Indian Status, or ID_Q05 on membership in a First Nation or Indian band were imputed values based on their responses to the National Household Survey. For those who self-reported as an Aboriginal person on APS question ID_Q01 but who did not report any specific Aboriginal group in ID_Q02, an imputation was also conducted based on the respondent’s answer to the NHS.  In addition, an imputation was performed for a person who had not identified being in any Aboriginal group but had identified as being either (1) a Status Indian, (2) registered as a Status Indian under Bill C-31 or Bill C-3, or (3) a member of a First Nation or Indian band – these respondents were imputed to First Nations people (North American Indian).

Finally, although all of these edits across topics were performed systematically using computer programmed edits, there were some cases for which very complex combinations of information were reviewed and corrected manually.

5.8 Derived variables and NHS linkage

In order to facilitate more in-depth analysis of the rich APS dataset, over 500 derived variables were created by combining items on the questionnaire. Derived variables (DVs) were created across all major content domains.  In addition, more than 100 National Household Survey variables were linked to the final APS analytical file for 2012.

Many of the derived variables were straightforward and involved simply combining equivalent questions, such as those across educational streams, for instance.  Other simple derived variables involved the collapsing of categories into broader categories. In other cases, two or more variables were combined to create a new or more complex variable which would be useful for data analysts. Some of the derived variables were based on linked variables from the NHS, including multiple NHS geographies and Inuit regions.  Aboriginal ancestry was also taken from the NHS since it is not measured directly by the 2012 APS.

In constructing derived variables, a valid response category was generally not assigned to a respondent for a given derived variable if any part of the equation was not answered (that is, if any question used in the derived variable had been coded to “Don’t know”, “Refused” or “Not stated”). In such cases, the code assigned to the derived variable was labelled “Not stated”.

Most derived variable names have a “D” in the first character position of the name. Geography DVs are the exception, since they reflect the corresponding NHS variable name.  For all linked NHS variables, the NHS variable name was preserved as much as possible on the APS database. Some exceptions applied since APS variable names are restricted to eight characters whereas NHS variable names sometimes exceeded eight characters.

The 2012 APS codebook (data dictionary) identifies in detail which variables were derived and provides information on how the derivations were done.  Highlights of DVs are listed by theme in Appendix A along with other survey indicators.  A complete list of linked NHS variables and their accompanying notes are provided in the 2012 APS codebook (data dictionary) which accompanies the APS analytical file.

5.9 Creation of final data files and codebook (data dictionary)

Four final data files were created in data processing:

  • Final processing file
  • Analytical file for use in Research Data Centres
  • Public use microdata file (PUMF)
  • Inuit share files, as per data sharing agreement with the four Inuit regions

The final processing file is an in-house file that includes a number of temporary variables used exclusively for processing purposes.  The Analytical File, the Public use microdata file (PUMF) and the Inuit share files are dissemination files which are processed further for release purposes. Dissemination files are scheduled for distribution at various points in time following the APS release day of November 25, 2013 (please refer to Chapter 9 for more detailed descriptions and dissemination details).

The analytical file is distributed in Research Data Centres across Canada but can only be accessed by researchers who fulfill certain requirements. The analytical file is also used at Statistics Canada to produce data tables in response to client requests. The PUMF is constructed for wider public distribution. The Inuit share files are produced in accordance with data sharing agreements with the Inuit regions: Nunatsiavut, Nunavik, Nunavut and the Inuvialuit region. On all of these dissemination files, many steps have been taken to ensure respondent confidentiality.

In order to transform the final, cleaned processing file to a final Analytical data file for researchers, a number of steps were taken. First, a series of steps were taken for the enhanced protection of respondent confidentiality. Next, person-weights were added to the file.  Weighting is described in more detail in chapter 6. Finally, all temporary variables or variables used exclusively for processing purposes were removed from the final processing file.

Accompanying the 2012 APS analytical file is the record layout, SAS (Statistical Analysis System) and SPSS (Statistical Package for the Social Sciences) syntax to load the file, and metadata in the form of a codebook that describes each variable and provides weighted and unweighted frequency counts.  The codebook is also referred to as the data dictionary.

The public use microdata file (PUMF) undergoes more extensive data processing for the protection of respondent confidentiality. In order to ensure the non-disclosure of confidential information, the level of detail of the PUMF is not as fine as that of the analytical files kept by Statistics Canada.  Actions are taken to protect against the recognition of respondents with potentially identifiable combinations of characteristics. These protective actions include the restriction of geographies included in the file, adjustments to survey weights, review of overlaps with other PUMFs being published, exclusion of variables, grouping of categories for some variables, capping of some extreme numerical values, as well as identification of unique records at risk and rare occurrences.

Date modified: