Data Collection, Capture and Coding

Scope and purpose
Principles
Guidelines
Quality indicators
References

Scope and purpose

Data collection is any process whose purpose is to acquire or assist in the acquisition of data. Collection is achieved by requesting and obtaining pertinent data from individuals or organizations via an appropriate vehicle. The data is either provided directly by the respondent (self-enumeration) or via an interviewer. Collection also includes the extraction of information from administrative sources which may require asking the respondent permission to link to administrative records.

Data capture refers to any process that converts the information provided by a respondent into electronic format. This conversion is either automated or involves staff keying the collected data (keyers). Data coding is any process that assigns a numerical value to a response. Coding is often automated, however, more complex decisions usually require human intervention (coders).

Often, survey operations involve a high degree of automation, which leads to the availability of paradata, information related to a survey process. Examples of paradata include an indicator of whether or not a unit is in the sample, history of calls and visits, trail of key strokes (audit trail), mode of collection, administrative information (e.g. interviewer profile) and cost information.

Data collection is not only the source of information, it is also the main contact a survey-taking agency has with the public who needs to be convinced to participate. Data capture and coding produce the formatted data used as input by all the subsequent survey processes. Data collection, data capture and coding operations often use a large portion of the survey budget and require considerable human and physical resources as well as time.

Principles

Respondents are a survey-taking organization's most valuable resource. Every variable that cannot be derived from other existing sources is a burden to the respondent. The amount of time and energy a respondent spends on providing data must be minimized. Privacy and security must be respected throughout all data gathering and processing operations. Given that these operations have a high impact on data accuracy, quality and performance measurement tools should be used to manage data collection, data capture and data coding processes.

Guidelines

Data collection

  • Careful planning of the collection process should include the establishment of roles and responsibilities regarding all aspects linked to collection, including its communication strategy, execution, assessment, monitoring, contingency planning and security.

  • Design the collection process in order to reduce respondent burden and collection cost, and to maximise timeliness and data accuracy. Data could be collected through self-enumeration, telephone interviews or personal interviews with either a paper or an electronic questionnaire (e.g. electronic data reporting, Internet, computer-assisted interviewing). To achieve the design objectives stated above more easily, consider using more than one method throughout the collection cycle. For instance, collection may start with self-enumeration using a paper or Internet questionnaire and may finish with an personal interview. For self-enumeration surveys, use multiple events (e.g. pre-collection advertisement card, introduction letter with the questionnaire, reminder card, reminder call or visit) over the collection period in order to stimulate the return of questionnaires. Examine whether some of the data elements could be acquired via administrative records instead of the more costly and sometimes less accurate traditional collection methods. Consider conducting the collection as a supplement to a large scale survey. This would not only potentially reduce survey costs and respondent burden but also make available a wealth of information for nonresponse adjustment. When feasible, conduct pilot studies or tests to help determine or fine-tune the collection operation.

  • Establish appropriate sample control procedures and measures for all data collection operations (e.g. delivery and return of paper questionnaires, follow up on gaps or inconsistencies, follow up on nonresponse). Such procedures track the status of sampled units from the beginning through to the completion of data collection so that data collection managers and interviewers can assess progress at any point in time. This is particularly important for surveys that use many data collection modes and that move cases from one mode to another (or from one collection centre to another). Sample control procedures are also used to ensure that every sampled unit is processed through all the steps subsequent to data collection (i.e. capture and coding steps), with a final status being recorded. Sample control measures can be used to evaluate the efficiency of those procedures.

  • Establish and maintain good respondent relationships in order to obtain a good response rate. Such measures can include advertising the upcoming survey, an introductory letter to inform the respondents that they will be part of a survey, an informative brochure with key statistics to maintain their interest in participating in the survey (in particular for longitudinal surveys) or procedures facilitating access to publicly available information (for example on a website, a guide to complete the questionnaire or helpline particularly for self-enumeration surveys) or a letter thanking them for their participation. These measures will help to sensitize the units selected in the sample to participate in the survey.

  • When collecting data, ensure that the respondent or the appropriate person within the responding household or organization is contacted at the appropriate time. Allow the respondent to provide the data in a method and format that is convenient to the individual and his or her organization. This will help increase response rates and improve the quality of the information obtained from the respondents. Special reporting arrangements should be considered in specific cases in order to reduce respondent burden and to facilitate the collection of information.  For example, consider creating a special collection arrangement for enterprises that are involved in many surveys. For households, when the targeted respondent is not available, establish rules to determine who could act as an appropriate proxy should this be an option.

  • For collection by interview, determine the best time to call or to visit survey units based on paradata acquired during previous iterations of the survey or from a similar survey. Manage calls or visits in such a way that respondents are contacted at the best time and that the number of call or visit attempts does not exceed a useful maximum. In addition, respondents should each be assigned a priority level so that they may be contacted or visited for interviews based on order of importance. Assignment priority should be based on the target effective sample size by domain of interest that would lead to estimates accurate enough (having low bias and variance) to be released. For business surveys, this would mean giving higher priority to large or influential units first, possibly at the risk of missing smaller units. For household surveys, priority should be given to units less likely to respond. A score function is a useful tool for prioritization. For telephone interviews, use an automated system to manage case call scheduling. Such a system should also prioritize cases.

  • Interviewers are vital to the success of data collection operations. Interviewer manuals and training must be carefully prepared and planned, since they provide the best way to guarantee data quality (e.g. high response rate and accurate responses), the comprehension of survey concepts and subject matter, as well as to ensure proper answers to questions from respondents. Training can use different approaches such as home study, classroom training, mock interviews or live interviews. Interviewing skills of interviewers should be monitored to ensure that they conform to a pre-established list of standards (e.g. reading questions as written in the questionnaire). This monitoring should also be used to identify strengths and weaknesses in the interviewer's skill set, to provide feedback to the interviewers and to focus training on weaker areas. Depending on the interviewing mode and resources, the monitoring may either be done using recordings of the interviews or live. Consultation with interviewers and staff directly responsible for collection operations will help in the development of better training tools. Follow-up interviews with respondents may also be used to get the respondent's point of view on how the interview was carried out.

  • Tracing should be conducted to locate and contact respondents when the available contact information on the survey unit is likely to be outdated. Tracing increases response rate and also helps in determining if the sampled unit is still in scope. Consider using administrative sources (e.g. telephone files, other survey frames) prior to survey collection and during collection in order to update contact information. During collection, facilitate high quality tracing by obtaining extra information related to the sample unit, for example, the names of other family members, relationship, age, etc. Local knowledge might also be useful. Consider forming a team of tracing experts when the survey is repeated or its collection period is over several months. In between cycles, facilitate feedback from the respondent to update contact information. For example, provide the respondents with a "change of address" card and ask them to notify the Agency if a move occurs. Collect tracing information (e.g. internet address, cell phone number) that can be used in the subsequent survey cycles.

  • For self-enumeration surveys, once the data is received, verify gaps or inconsistencies related to accuracy of the coverage information and the quality of the data provided. Follow-up interviews may be needed in some cases (e.g. when the questionnaire is missing a large number of items). Assign a follow-up priority based on the statistical importance of these units and of the missing items.

  • Given that self-enumeration surveys tend to result in lower unit response rates, consider following up with non-respondents by telephone or in person to obtain their participation or conduct an interview. Ensure that collection staff is informed in a timely fashion of the registration of returned questionnaires in order to avoid unnecessary follow-ups. This type of follow-up is particularly important in the case of longitudinal surveys where the investment is clearly more long-term and the sample is subject to accumulating attrition (and possibly bias) due to nonresponse at each survey occasion. Unit nonresponse follow-ups should also be prioritized with the approach described above for managing interview surveys. Paradata (e.g. number of call or visit attempts) can also be useful to prioritize follow-ups.

  • As a last step of the collection operation, consider contacting a sub-sample or all of the non-responding units (including unresolved cases) to determine whether they are in scope or not (e.g. active business or not, occupied dwelling or not); and, if so, a critical data item such as size (e.g. business total income, household size) should be obtained. This information will be useful for the nonresponse adjustment. In some instances, the information can be obtained from or approximated by current administrative data for all of the non-responding units.

  • Provide plans and tools to actively manage survey data collection while it is in progress. Productivity measures (e.g. daily and cumulative number of units resolved) and cost indicators (e.g. daily and cumulative interviewer hours and travel expenditures) can be used to assess the relationship between the collection effort and the results (e.g. unit response rate). Compared with planned values, these indicators also help survey managers in their decision-making throughout the collection period. Used in conjunction with the daily response rate, the daily productivity rate and daily average unit cost provide the marginal cost of response rate increase during the course of collection.  Activity and cost indicators (according to selected unit or completed questionnaire) also make it possible to evaluate the additional costs and effort required to increase response rates, particularly towards the end of the collection period.

  • Every effort should be made to ensure the confidentiality of the data. Staff handling confidential data must be familiar with best practices regarding the printing, handling and filing of paper documentation, the handling of electronic files, and the rules regarding the dissemination of information.

  • Consider implementing a re-interview program to assess the overall accuracy of interviewing operations.

  • Use paradata to identify operational efficiency and cost-efficiency opportunities (e.g. sequence of calls, best time to call, optimal limit for calls or visits, etc.) in order to improve current and future collection processes and practices. For example, use average and distribution of interview duration to plan the next survey cycle. Interview duration can also be used to evaluate part of respondent burden. If interview duration is analysed by interviewer, it can be used as well to identify those potentially requiring additional training (e.g. those with outlying average duration).

Data capture

  • Design the capture process in order to reduce capture cost and to maximise timeliness and data accuracy. Data items could be captured during survey collection by the respondents (e.g. Internet, EDR) or the interviewers (e.g.CATI, CAPI). This obviously reduces the cost of capture, increases the timeliness and has the potential of improving accuracy through edit rules being integrated into the computer application. When it is not feasible to integrate capture with collection, the capture is performed either by operators (manual key entry) or in an automated fashion (scanning followed by Intelligent Character Recognition). The latter is preferred as it reduces cost and often enhances accuracy of the data.

  • For CATI and CAPI interviewers, who often perform data capture and coding during collection, use standard collection tools and process (e.g. standard screens and standardized questions) to ease interviewer work and limit the risk of introducing a capture error. Integrate edit rules in the collection system to validate the entry of data items and allow for potential corrections of errors (i.e. keying error, response error and missing item) at the time of collection.

  • Data capture operators are critical to the success of the capture operations. Ensure that they have appropriate training and tools. Prepare training material and procedures for the keyers and deliver training sessions. This will enhance the skills of the staff and thus ensure accurate capture of data collected. Use quality control methods to verify whether the accuracy of capture performed by operators meets the pre-established levels and provide them with feedback for improvement.

  • Manual data capture from paper questionnaires or scanned images is subject to keying errors. Incorporate online edits for error conditions that the data capture operator can correct (i.e. edits that will identify keying errors). Record these cases for later review and analysis. When feasible, the manual operation should be tested prior to conducting the survey.

  • For automated data capture, ensure that the questionnaire is designed to ease the scanning and the intelligent character recognition.

  • When automated capture is used, some questionnaires cannot be scanned and others can be scanned but characters cannot be recognised. For damaged or badly scanned questionnaires, use a team of keyers to perform the capture.

  • Systems for automated data capture by intelligent character recognition from scanned images should be tested prior to implementation. Such systems may cause relatively high rates of systematic errors in specific data items. It might be possible to improve the algorithms and their parameters to reduce the error rates. For the data items at high risk of systematic error, consider using keyers.

  • Keyers should also be used to conduct a sample study assessment of the accuracy of automated capture. The results of such a study can be used to improve the process.

  • Institute effective control of systems to ensure the security of data capture, transmission and handling, especially with new technologies such as cell phone and Internet data collection. Prevent loss of information and the resulting decline in quality, and potentially in credibility, due to system failures or human errors. Develop procedures for destroying the data when no longer needed.

 Data coding

  • Design the coding process in order to reduce coding cost, to maximize timeliness and data accuracy. Often data items are precoded during collection with the use of closed questions. This obviously reduces the cost of coding and could also improve accuracy. When this is not feasible and open questions are asked, the coding is performed after collection either by operators or in an automated fashion (e.g. using the Automated Coding with Text Recognition system). The latter is preferred as it often reduces cost and enhances accuracy of the results.

  • For manual coding operations, make sure that the procedures are applied to all units of study as consistently and in as error-free a manner as possible. A computer-assisted operation is desirable. Enable the staff or systems to refer difficult cases to a small number of knowledgeable experts. Centralize the processing in order to reduce costs and make it simpler to take advantage of available expert knowledge. Given that there can be unexpected results in the collected information, use processes that can be adapted to make appropriate changes if necessary from the point of view of efficiency. When feasible, the manual operation should be tested prior to conducting the survey.

  • Data coding operators are critical to the success of the coding operations. Ensure that they have appropriate training and tools. Prepare training material and procedures for the coders and deliver training sessions. This will enhance the skills of the staff and thus ensure accurate coding of data. Use quality control methods to verify whether the accuracy of coding performed by operators meets the pre-established levels and provide them with feedback for improvement.

  • For automated coding, build and maintain reference files to maximize phrases recognized while minimizing errors. When automated coding is used, often a number of cases remain uncoded. The use of a team of coders is an appropriate approach to complete these cases.

  • Expert coders should be used to conduct a sample study assessment of the accuracy of automated coding. The results of such a study can be used to augment and to improve the content of reference files used.

Quality control

  • Use statistical quality control methods to assess and improve the quality of collection, capture and coding operations. Collect and analyze quality control measures and results in a manner that would help identify the major rootcauses of error. Provide feedback reports to managers, staff, subject matter specialists and methodologists. Use measures of quality and productivity to provide feedback at the interviewer or operator level, as well as to identify error-causing elements in the design of the operation or its processing procedures. These reports should contain information on frequencies and sources of error (see Mudryk et al, 1994, 1996 and 2002; Mudryk and Xiao, 1996). Various software tools are available to help in this regard. These include the Quality Control Data Analysis System (QCDAS) and NWA Quality Analyst (see Mudryk, Bougie and Xie, 2002).

 Post-mortem analysis

  • Conduct a post-mortem evaluation of data collection, capture and coding operations, and document the results for future use. Evaluate the processes to identify the lessons learned with the goal of improving each of its components. For that purpose, post-survey studies are often useful.

  • Use subsequent survey processes to gather useful information regarding quality that can serve as signals indicating that collection, capture and coding procedures and tools may require changes for future survey cycles. For example, the editing or data analysis stages may suggest the possibility of response bias or other collection-related problems.

Quality indicators

Main Quality Element: Accuracy

The impact of data collection and capture operations (including coding) on data quality and cost is both direct and critical, as these data are the primary inputs of a survey-taking agency, and often the most important survey expenditure components. The quality of these operations thus has a very high impact on the quality of the final product, in particular, on its accuracy.

Quality measures gathered during the data collection operation enable the survey manager to make decisions regarding the need for process modification or redesign. Important quality measures include response rates, processing error rates, follow-up rates and rates of nonresponse by reason. When these measures are available at all levels at which estimates are produced and at various stages of the process, they can serve both as performance measures and measures of data quality.

Proxy rates

Report proxy rates (i.e. percentage of cases where responses are obtained from a respondent who is not the selected survey unit) as an indicator of potential response error.

Nonresponse rates

Report nonresponse rates as an indicator of nonresponse bias. Unit nonresponse can be decomposed in many components, for example, the interview was prevented due to noncontact, refusal, temporary absence, technical problem, language problems or the respondent's mental or physical condition. To reflect the uncertainty related to coverage, unit no-response can also be decomposed as cases resolved (i.e.in-scope status is determined) versus unresolved (i.e. in-scope status is undetermined). Report item nonresponse (e.g. refusal and don't know) to key questions. Item nonresponse rates may vary for early and late respondents (i.e. those who require more calls or visits). Both unit and item nonresponse rates can be reported by domain of interest to be released (in this instance it can also serve as a release criteria) and also by sub-population (e.g. large and small businesses, young and older adults) to indicate how well the effective sample represents the population. Unit nonresponse rate and item nonresponse rate can also be combined to provide global nonresponse rate by item. Other useful indicators are the refusal conversion rate and tracing conversion rate (in the case of erroneous or outdated initial contact information). For surveys with topics of a more sensitive nature, refusal rate at the first contact can be reported.

In-scope/out-of-scope error rates

When an in-depth study is conducted on how well the collection operation has classified nonresponding units as in-scope/out-of-scope (e.g. business: active/inactive, dwelling: occupied/unoccupied), report the rate of being classified as in-scope when truly out-of-scope and the rate of being classified as out-of-scope when truly in-scope. These rates can be reported by domain of interest to be released.

Average interview length distribution

Report the average and the distribution of interview duration. In particular, report the percentage of extremely short interviews which may indicate problems with the reported data. Analysis of interview length can also be used to evaluate part of respondent burden.

Mode effect

A mode effect is a measurement bias that is attributable to the mode of data collection. Ideally, mode effects can be investigated using experimental designs where sample units are randomly assigned into two or more groups. Each group is surveyed using a different data collection mode. All other survey design features are controlled. Differences in the response distributions for the different groups can be compared and assessed. Other methods, such as the propensity score method or regression analysis, can be used to assess mode effects when experimental designs cannot be applied.

Edit reject rates

Report the rate of edit rejects, the number and type of corrections applied by domain, collection mode, processing type, data item and language of collection. This will help in evaluating the quality of the data and the efficiency of the editing function used in collection and capture operations. The edit reject rates can be decomposed by the reason of rejection, (i.e. the item is missing or the item reported is inconsistent with the normal range of values for that item or with other items reported). The latter component is an indicator of measurement error (i.e. response error + capture error).

Outgoing capture/coding error rates

Report outgoing capture/coding error rates in manual and automated operations calculated from results of quality verification or studies. When both manual and automated capture/coding is used, calculate composite rates. Overall rates can be calculated, as well as rates by domain, collection mode, processing type, data item and language of collection.

References

Bethlehem, J., F. Cobben, B. Schouten. 2008. "Indicators for the Representativeness of Survey Response." Proceedings from the 2008 International Symposium on Methodological Issues, Statistics Canada.

Couper, M.P., R.P. Baker, J. Bethlehem, C.Z.F. Clark, J. Martin, W.L. Nicholls II and J. O'Reilly (eds.) 1998. Computer Assisted Survey Information Collection. New York. Wiley-Interscience. 653 p.

Dielman, L. and M.P. Couper. 1995. "Data quality in a CAPI survey: keying errors." Journal of Official Statistics. Vol. 11, no. 2. p. 141-146.

Dillman, D. A. 2006. Mail and Internet Surveys: The Tailored Design Method.  New York. Wiley. 554 p.

Groves, R.M. 1989. Survey Errors and Survey Costs. New York. John Wiley and Sons. 620 p.

Groves, R.M., P. Biemer, L. Lyberg, J. Massey, W. L. Nicholls and J.Waksberg (eds.) 1988. Telephone Survey Methodology. New York. Wiley-Interscience. 608 p.

Groves, R.M. and S.G. Heeringa. 2006, "Responsive design for household surveys: Tools for actively controlling survey errors and costs." Journal of the Royal Statistical Society. Series A. Vol. 169, no. 3. p. 439-357.

Hunter, L. and J.-F. Carbonneau. 2005. "An Active Management Approach to Survey Collection." Proceedings from the 2005 International Symposium on Methodological Issues, Statistics Canada.

Laflamme, F. and C. Mohl. 2007. "Research and Responsive Design Options for Survey Data Collection at Statistics Canada."  Proceedings of the Section on Survey Research Methods. American Statistical Association.

Laflamme, F., M. Maydan and A. Miller. 2008. "Using Paradata to Actively Manage Data Collection." Proceedings of the Section on Survey Research Methods. American Statistical Association.

Laflamme, F. 2008. "Data Collection Research using Paradata at Statistics Canada." Proceedings from the 2008 International Symposium on Methodological Issues, Statistics Canada.

Laflamme, F., 2008, "Understanding Survey Data Collection Through the Analysis of Paradata at Statistics Canada."  American Association for Public Opinion Research 63rd Annual Conference, 2008. Proceedings of the Section on Survey Research Methods. American Statistical Association.

Lepkowski, James M. et al. 2007. "Advances in Telephone Survey Methodology." Second International Conference on Telephone Survey Methodology, Miami 2006. Wiley series in survey methodology section. p. 363-367.

Lyberg, L., P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz and D. Trewin (eds.) 1997. Survey Measurement and Process Quality. New York. Wiley-Interscience. 808 p.

Mudryk, W., M.J. Burgess and P. Xiao. 1996. "Quality control of CATI operations in Statistics Canada." Proceedings of the Section on Survey Research Methods.American Statistical Association. P. 150-159.

Mudryk, W., B. Joyce, H. Xie. 2004. "Generalized Quality Control Approach for ICR Data Capture in Statistics Canada's Centralized Operations." European Conference on Quality and Methodology in Official Statistics.Federal Statistical Office, Germany.

Rosenbraum P.R. and D.B. Rubin. 1983. "The central role of the propensity score in observational studies for causal effects." Biometrika. Vol. 70, no. 1. p. 41-45.

Statistics Canada. 1998a. "Policy on Informing Survey Respondents." Statistics Canada Policy Manual. Section 1.1. Last updated March 4, 2009.

Statistics Canada. 2001d. Standards and Guidelines for Reporting of Nonresponse Rates.  Statistics Canada Technical Report.

Statistics Canada. 2003. Survey methods and practices. Statistics Canada Catalogue no. 12-587-XPE. Ottawa, Ontario. 396 p.

Date modified: