5. Which data sources to bolster surveys?

Constance F. Citro

Previous | Next

For decades after the introduction of probability sampling in official statistics, the only alternative source was administrative records - from various levels of government, depending on a country’s governmental structure (federal, state and local in the United States), and from nongovernmental entities (e. g., employer payroll records or hospital admission records). And a number of national statistical agencies around the world began to incorporate administrative records into their programs - from using them in an ancillary way to moving census and survey programs lock, stock and barrel to an administrative records-based paradigm.

Technological innovations in the 1970s and 1980s led to some additional data sources - such as records of expenditures at checkouts (made possible by the development of bar codes and scanners) and aerial and satellite images for categorizing land use - becoming at least potentially available for official statistics. But the landscape of data sources was still relatively contained. Beginning in the 1990s, the advent of the Internet and high-speed distributed computing technology unleashed a mind-boggling array of new data sources, such as data from traffic camera feeds, tracking of cell phone locations, search terms used on the Web and postings on social media sites. The challenge for statistical agencies is to classify and evaluate all of these data sources in ways that help agencies determine their usefulness.

5.1 Is “Big Data” a useful concept?

Many new types of data that have become available in the past 15 or so years are often very large in size, leading to the use of the term “big data”. I argue that this buzz phrase does little, if anything, to assist statistical agencies to determine appropriate combinations of data for their programs. In computer science, “big data is high volume, high velocity and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery, and process optimization” (Laney 2001). These properties are not inherent in any particular type of data or in any particular platform, such as the Internet. Instead, what qualifies as “big data” is a changing target, as advances are made in high-speed computing and data analysis techniques. In today’s computing environment, census, survey, and administrative records data rarely qualify as “big”, although they may have done so in an earlier era. People today tend to classify as “big” the data streams from cameras, sensors, and largely free-form interactions with the Internet, such as social media postings. In the future, many of these kinds of data may no longer fit under this rubric. In regard to the Internet, moreover, it not only generates a great deal of today’s “big data”, but also provides ordinary-size data in a more accessible way - for example, access to public opinion polls or to local property records.

I would argue that statistical agencies will most often want to be and should be “close followers” rather than leaders in using big data. It seems to me most appropriate for academia and the private sector to be out front in tackling the uses of data that are so voluminous and of such high velocity and variety that they require big leaps forward to develop new forms of processing and analysis. Statistical agencies should be alert to developments in the field of big data that promise benefits for their programs down the road and may be well advised to support research in this area to help ensure that applications that are relevant to their programs emerge. Principally, however, I believe that statistical agency resources are best used primarily for working with data sources that offer more immediately useful benefits.

Groves (2011) has attempted to move toward a more relevant classification for statistical agencies than that between “big data” and all other data, by distinguishing between what he terms “designed data” that are “produced to discover the unmeasured” and “organic data” that are “produced auxiliary to processes, to record the process”. Keller, Koonin and Shipp (2012) list examples of data sources under Groves’ two headings. Their list of designed data includes: administrative data (e.g., tax records); federal surveys; censuses of population; and “other data collected to answer specific policy questions”. Their list of organic data includes: location data (cell phone “externals”, E-ZPass transponders, surveillance cameras); political preferences (voter registration records, voting in primaries, political party contributions); commercial information (credit card transactions, property sales, online searches, radio-frequency identification); health information (electronic medical records, hospital admittances, devices to monitor vital signs, pharmacy sales); and other organic data (optical, infrared and spectral imagery, meteorological measurements, seismic and acoustic measurements, biological and chemical ionizing radiation). Not mentioned under either category are such data as Facebook or Twitter postings, although they might fall under the broad rubric of “online searches”.

Whether the two-part classification in Keller et al. (2012) is all that more useful than “big data” for statistical agency purposes is a question. For example, classifying voter registration records or electronic health records as organic data and not as designed administrative data seems to miss ways in which they differ from such sources as online searches and ways in which they are similar to federal and state government administrative records. Moreover, even organic data are “designed”, if only minimally, in the sense that the provider has specified some parameters, such as 140 characters for a Twitter post or a particular angle of vision for a traffic camera. Nonetheless, the designed versus organic distinction does point to a useful dimension, which is the degree to which statistical agencies have ready access to, control changes to, and are able readily to understand the properties of a data source.

5.2 Dimensions of data sources: Illustrations for four major categories

Coming up with satisfactory nomenclature and evaluation criteria that can help statistical agencies assess the potential usefulness of alternative data sources for their programs, with the goal of becoming as familiar with the error properties of alternative sources as they are with total error for surveys, is not going to happen without considerable effort by statistical agencies around the world (Iwig et al. 2013 and Daas et al. 2012 are examples of such efforts). I do not pretend that I can come close to that goal in this paper. My goal is more modest - namely, to provide some illustrations so that those who are wedded to a probability survey paradigm (or an administrative records paradigm) can see that the task of understanding alternative data sources is both feasible and desirable. I provide illustrations for four data sources ranging from traditional to cutting-edge:

  1. Surveys and censuses, or a collection of data obtained from responses of individuals, who are queried on one or more topics as designed by the data collector (statistical agency, other government agency, and academic or commercial survey organization) according to principles of survey research with the goal of producing generalizable information for a defined population.

  2. Administrative records or a collection of data obtained from forms designed by an administrative body according to law, regulation, or policy for operating a program, such as paying benefits to eligible recipients or meeting payroll. Administrative records are usually ongoing and may be operated by government agencies, or non-governmental organizations.

  3. Commercial transaction records, or a collection of data obtained from electronic capture of purchases (e.g., groceries, real estate) initiated by a buyer but in a form determined by a seller (e.g., bar-coded product information and prices recorded by check-out scanners or records of product and price information for Web sales, such as through Amazon).

  4. Interactions of individuals with the WorldWide Web by using commercially provided tools, such as a Web browser or social media site. This category covers a wide and ever-changing array of potential data sources for which there are no straightforward classifications. One defining characteristic is that individuals providing information, such as a Twitter post, act as autonomous agents: they are not asked to respond to a questionnaire or required to supply administrative information but, instead, are choosing to initiate an interaction.

I first rank each source on the following two dimensions, which relate to the framework in Biemer et al. (2014). The rank I assign assumes there have been as yet no proactive steps by a statistical agency to boost the ranking (e.g., by embedding staff in an administrative agency to become deeply familiar with the agency’s records). The two dimensions are:

  1. Degree of accessibility to and control by national statistical agency: high (statistical agency designs the data source and controls changes to it); medium (statistical agency has authority to use the data source and influence on changes to it); low (statistical agency must arrange to obtain the data source on the terms of the provider and has little or no influence on changes to it). Gradations can be added to each of these categories depending, for example, on how strong an agency’s authority is to acquire a set of administrative records.

  2. Degree to which components of error can be identified and measured: high, as in designed surveys and censuses; medium, as in public and private sector administrative records; and low, as in streams of data from autonomous choices of individuals.

I further identify aspects of data quality for each source, following Biemer et al. (2014). I also indicate variations for most of the dimensions depending on the provider, such as national statistical agency, other unit of national government, other level of government, academic institution, or commercial entity. Table 5.1 provides all of this information as best I can.

An ideal source for statistical agency use, other things equal, is one that is provided, designed, and controlled by the agency, and for which errors can be identified and measured and are generally under control, such as a high-quality probability survey mounted by the agency. At the other extreme is a data source that is controlled by one or more private companies (e.g., scanner data) or, perhaps, by hundreds or thousands of local governments (e.g., traffic cameras), where the data result from autonomous choices or uncontrolled movements, and where it is difficult to conceptualize, much less measure, errors in the data source. Yet when considering a statistical agency’s responsibility to provide relevant, timely, accurate statistics for policymakers and the public for which costs and respondent burden are minimized, there may well be non-survey data sources that warrant the effort to make them usable for statistical purposes. I argue that the threats to the survey paradigm reviewed above make it imperative to consider alternative data sources because surveys are no longer always and everywhere demonstrably the superior choice to other sources - they are not always “high” on the dimensions in Table 5.1.

I further argue that government administrative records, which, as Table 5.1 indicates, more often have desirable properties for official statistics compared with other non-survey data sources, should be a prime candidate for statistical agencies to incorporate as extensively as possible into their survey programs if they have not already done so. Administrative records are generated according to rules - rules about the eligible population, who must file what information, what action by the pertinent administrative body is taken on the basis of the information (e.g., tax refund, benefit payment), and so on. This fact should make it possible, with requisite effort, for a statistical agency to become as familiar with administrative records error structures as they are with total survey error. Couper (2013) provides a useful discussion somewhat like mine. He pokes holes in the ability of organic data sources to be as useful as they are often touted to be, much less to be suitable to replace probability surveys, but he warns survey researchers that they ignore organic data sources at their peril. Ironically, his conclusion to make some use of organic sources is strengthened because of his error in classifying administrative records as organic data. They are properly classified as designed data, even though not designed by a statistical agency.

Table 5.1
Ranking (HIGH, MEDIUM, LOW, VERY LOW, or VARIES) of four data sources on dimensions for use in official statistics
Table summary
This table displays the results of Ranking (HIGH. The information is grouped by Dimension/Data Source (appearing as row headers), Census/Probability Survey (e.g., CPS/ASEC, ACS, NHIS - see Table 2.1), Administrative Records (e.g., income taxes, Social Security, unemployment, payroll), Commercial Transaction Records (e.g., scanner data, credit card data) and Individual Interactions with the Internet (e.g., Twitter postings; Google search term volumes), calculated using Data Quality Attributes (Biemer et al. 2014) and Accuracy (Components of Error)* units of measure (appearing as column headers).
Dimension/Data Source Census/Probability Survey (e.g., CPS/ASEC, ACS, NHIS - see Table 2.1) Administrative Records (e.g., income taxes, Social Security, unemployment, payroll) Commercial Transaction Records (e.g., scanner data, credit card data) Individual Interactions with the Internet (e.g., Twitter postings; Google search term volumes)
Degree of Control by/ Accessibility to Statistical Agency HIGH (survey conducted for statistical agency);

MEDIUM to LOW (survey conducted for private organization)
HIGH to MEDIUM (national agency records);

MEDIUM to LOW (state or local records);

MEDIUM to LOW (commercial records)
MEDIUM to LOW VERY LOW
Degree of Ability of Statistical Agency to Identify/Assess Properties/Errors HIGH (survey conducted for statistical agency);

VARIES (survey conducted for private org., depends on documentation and transparency)
HIGH to MEDIUM (national agency records);

MEDIUM to LOW (state or local records);

MEDIUM to LOW (commercial records)
MEDIUM (to the extent that records follow accepted standards, e.g., for bar coding and pricing information) VERY LOW
  Data Quality Attributes (Biemer et al. 2014)
Relevance for Policy and Public - Concepts and Measures HIGH for survey conducted for statistical agency, assuming well designed and up to date in concepts and measures;

VARIES for surveys for private organizations
VARIES across and within records systems (e.g., records of benefit payment may be highly relevant, while family composition information may use a different concept) VARIES VARIES, but VERY LOW at the present state of the art of acquiring, evaluating, and analyzing these kinds of data
Relevance - Useful Covariates HIGH for most surveys VARIES, but rarely as high as for most surveys VARIES, but rarely as high as for most surveys VARIES, but typically LOW
Frequency of Data Collection Weekly to every few years (every decade for the U.S. population census); Some private surveys, such as election polls, may run daily Generally records are updated frequently (e.g., daily) and continually Generally records are updated frequently (e.g., at moment of transaction or daily) and continually Interactions are captured instantaneously
Timeliness of Release VARIES, depending on effort of statistical agency or private organization, but some lag from the reference period for responses is inevitable VARIES, but some lag from the reference date to when records are acquired by statistical agency likely VARIES, but likely to be long lags in acquiring proprietary data by statistical agency VARIES, but likely to be long lags (although MIT Billion Prices Project has worked out very timely access for prices on the Internet; see bpp.mit.edu)
Comparability and Coherence HIGH across time and geography within survey (except when deliberately changed or if societal change that affects measurement is not taken into account);

VARIES among surveys
HIGH within records system (changes to government records generally heralded by legal/ regulation/policy change, changes to commercial records likely opaque);

VARIES among records systems
HIGH within records system (changes generally opaque to statistical agency);

VARIES among records systems
VERY LOW, in that vendors (e.g.,Twitter) may add/subtract features or drop an entire product; Changes generally opaque to statistical agency; Initiators of interactions may have very different frames of reference
  Accuracy (Components of Error)Note *
Frame Error VARIES, can be significant undercoverage and overcoverage Frame is usually well defined by law, regulation, or policy; Problem for statistical agency use is that frame may not be comprehensive Frame is ill-defined for statistical agency purposes, in that represents whoever had a purchase scanned by a specified vendor or used a specific credit card for a purchase during a specified time; Poses significant challenge to statistical agency to determine appropriate use Frame is ill-defined for statistical agency purposes, in that represents whoever, decided to, for example, set up Twitter account or conduct Google search during a specified time; Poses significant challenge to statistical agency to determine appropriate use
Nonresponse (unit and item) VARIES, can be significant VARIES (e.g., Social Security records likely to include almost all eligible people, but income tax records likely to reflect evasion, in terms of failure to file a return or concealing some income) NOT APPLICABLE, in that “respondents” are self selected; Statistical agency challenge is to determine appropriate use that does not need to assume a probability mechanism NOT APPLICABLE, in that “respondents” are self selected; Statistical agency challenge is to determine appropriate use that does not need to assume a probability mechanism
Measurement Error VARIES within surveys by item and among surveys for comparable items; Often not well assessed, even for statistical agency surveys VARIES among record systems and within record systems by item depending on centrality of the item to program operation (e.g., benefit payment item likely more accurate than items obtained from beneficiaries, such as employment status) NOT APPLICABLE to data source as such, although any characteristics added by the vendor from another source may/may not be valid; Statistical agency challenge is to not introduce measurement error by inappropriate use of the data NOT APPLICABLE to data source as such, although any characteristics added by the vendor from another source may/may not be valid; Statistical agency challenge is to not introduce measurement error by inappropriate use of the data
Data Processing Error VARIES (e.g., may be data capture or recoding errors), but is usually under good statistical control, although harder to assess for private organization surveys VARIES (e.g., may be keying or coding errors), likely to be under better control for key variables (e.g., benefit payments) than for other variables, but hard for statistical agency to assess VARIES (e.g., may be errors in assigning bar codes or prices), likely to be under good control, but hard for statistical agency to assess NOT APPLICABLE, in that error is not defined, although there may be occasional problems of the sort that, say, a day’s worth of Twitter posts is overwritten and lost
Modeling/ Estimation Error Bias from such processes as weighting and imputation VARIES; Often intense effort by statistical agency to design well initially but not to revisit to ascertain continued validity of procedures NOT APPLICABLE (usually), in that records are “raw” data, except perhaps for some recoded variables, but bias may be introduced by statistical agency reprocessing NOT APPLICABLE (usually), in that records are “raw” data, except perhaps for some recoded or summarized variables, but bias may be introduced by statistical agency reprocessing NOT APPLICABLE (usually), in that records are “raw” data, but statistical agency reprocessing may introduce significant bias (e.g., by using the word “fired” as always indicating unemployment in analyzing Twitter posts)
Specification Error VARIES (e.g., self-reported health status may validly indicate respondent’s perception but not necessarily diagnosed physical or mental health); May change over time (e.g., as word usage changes among the public) VARIES; can be significant when administrative records concept differs from what statistical agency needs (e.g., rules for reporting earnings on tax forms may leave out such components as cafeteria benefits) VARIES; can be low or high depending on how well the data correspond to statistical agency needs VARIES, but likely significant at the present state of the art of acquiring, evaluating, and analyzing these kinds of data that arise from relatively free-form choices of autonomous individuals
BurdenNote * VARIES, can be high NO ADDITIONAL BURDEN from statistical agency on relevant population (e.g., beneficiaries), but burden on administrative agency NO ADDITIONAL BURDEN from statistical agency on relevant population (e.g., shoppers), but burden on vendor NO ADDITIONAL BURDEN from statistical agency on relevant population (e.g., Twitter posters), but burden on vendor
CostNote * VARIES, can be high; Statistical agency bears full costs of design, collection, processing, estimation VARIES, but could be lower than comparable survey because administrative agency bears data collection costs, but statistical agency likely incurs costs of special processing/ handling VARIES as for administrative records, but vendor likely to want payment; Statistical agency likely incurs costs of special processing/ handling/ analyzing VARIES as for administrative records, but vendor likely to want payment; Additional statistical agency costs for processing/analyzing unstructured data may be high

5.3 Uses of administrative records for household survey-based programs

Household survey respondents have demonstrated time and time again that their responses to many important questions on income, wealth, expenditures, and other topics are not very accurate. Use of administrative records has the potential in many instances to remedy this situation. An alternative strategy of many U.S. household survey programs has been to encourage the respondents themselves to consult their own records, such as tax returns, when answering questions on income and similar topics. Certainly, answers are likely to be more accurate when records are consulted, as Johnson and Moore (no date) find in a comparison of income tax records with SCF responses for the 2000 tax year. However, the strategy itself appears to be largely an exercise in futility. The same study of the SCF by Johnson and Moore reports that only ten percent of households with an adjusted gross income of less than $50,000 consulted records and that only 22 percent of higher income households did so. See National Research Council (2013a, pp. 89-91) and Moore, Marquis and Bogen (1996) for similar findings about the difficulties of getting respondents to consult records.

Turning to strategies for statistical agencies to work with administrative data directly, I identify eight ways in which administrative records can contribute to household survey data quality: (1) assist in evaluation of survey data quality, by comparison with aggregate estimates, appropriately adjusted for differences in population universes and concepts, and by exact matches of survey and administrative records; (2) provide control totals for adjusting survey weights for coverage errors; (3) provide supplemental sampling frames for use in a multiple frame design; (4) provide additional information to append to matched survey records to enhance the relevance and usefulness of the data; (5) provide covariates for model-based estimates for smaller geographic areas than the survey can support directly; (6) improve models for imputations for missing data in survey records; (7) replace “no” for survey respondents who should have reported an item, replace “yes” for survey respondents who should not have reported an item, and replace reported values for survey respondents who misreport an item; and (8) replace survey questions and use administrative records values directly. In a longer unpublished version of this article, I provide some current and potential examples of each type of use and identify benefits, confidentiality and public perception concerns, and limitations and feasibility issues for each use generically and specifically for U.S. household surveys on such topics as income, assets and expenditures. My bottom line is that the benefits should outweigh the drawbacks, given a sustained program to integrate administrative records systems with statistical programs.

5.4 Potential uses of non-traditional data sources

Having previously indicated that data from sources other than surveys and administrative records are problematic in a number of ways for official statistics, I would be remiss not to discuss briefly why such data appear to be so attractive. Private companies have very different loss functions from statistical agencies - they are seeking an edge over competitors. Data that are more timely and that identify ways to increase sales and profits are likely useful to a private company, even if they do not cover a population completely or have other drawbacks for official statistics. From this perspective, the kinds of experiments that a Google does, using its own “big data”, on ways to increase ad views are good investments (see, e.g., McGuire, Manyika and Chui 2012). Similarly, program agencies at all levels of government, often working with academic centers, are putting together and analyzing their own and other data in innovative ways to identify patterns, “hot spots”, and the like, not only for improving their programs and planning new services, but also for prioritizing resources and improving response in real time (see, e.g., the Center for Urban Science and Progress at New York University (http://cusp.nyu.edu/); and the Urban Center for Computation and Data at the University of Chicago (https://urbanccd.org)

Statistical agencies need, above all, sources of data that cover a known population with error properties that are reasonably well understood and that are not likely to change under their feet - characteristics that are not inherent in such data sources as autonomous interactions with websites on the Internet. There are, however, at least two ways in which household survey-based statistical agency programs could obtain an “edge” from non-traditional sources: one is to improve timeliness for preliminary estimates of key statistics; and the other is to provide leading indicators of social change (e.g., the emergence of new occupations and fields of training) that alert statistical agencies to needed changes in their concepts and measures.

Previous | Next

Date modified: