3. Chinks in the armor: Rising threats to the survey paradigm

Constance F. Citro

Previous | Next

Probability surveys are indispensable tools for official statistical agencies and others for many kinds of measures - for example, to track such phenomena as public approval of the U.S. president or expressed feelings of well-being. Moreover, probability surveys with a primary purpose to measure constructs, like household income, that could be obtained from other sources, have two major advantages: (1) they can obtain a wide variety of covariates for use in analysis of the primary variable(s) of interest, and (2) they are under the control of the survey designer. Yet threats to the probability survey paradigm are snowballing in ways that bode ill for the future. Manski (2014) goes so far as to accuse statistical agencies of sweeping major problems with their data under the rug and markedly understating the uncertainty in their estimates. He labels survey nonresponse as an example of “permanent uncertainty”.

3.1 Characterizing survey quality

A typology of errors and other problems that can compromise the quality of survey estimates is essential for understanding and improving official statistics. A seminal paper in developing data quality frameworks was Brackstone (1999). Most recently, Biemer, Trewin, Bergdahl and Lilli (2014) reviewed the literature on systematic quality frameworks, noting, in particular, the six dimensions proposed by Eurostat (2000): relevance, accuracy, timeliness and punctuality, accessibility and clarity, comparability (across time and geography), and coherence (consistent standards). Iwig, Berning, Marck and Prell (2013) reviewed quality frameworks from Eurostat, the Australian Bureau of Statistics, the UK Office for National Statistics, Statistics Canada, and other organizations and developed questions based on six quality dimensions of their devising - relevance, accessibility, coherence, interpretability, accuracy, and institutional environment - for U.S. statistical agencies to use to assess the utility of administrative records. Daas, Ossen, Tennekes and Nordholt (2012) constructed a framework for evaluating the use of administrative records to produce census data for the Netherlands.

Biemer et al. (2014) went further by using the Eurostat framework (combining comparability and coherence into a single dimension) as the basis for designing, testing and implementing a system of numerical assessments for evaluating and continually improving data product quality at Statistics Sweden. For a full assessment, it would also be necessary to evaluate quality dimensions against cost and respondent burden. Usefully for my purposes, Biemer et al. decomposed the dimension of “accuracy”, conceived of as total survey error (or total product error for non-survey-based statistical programs such as national accounts), into sampling error and seven types of nonsampling error: (1) frame error, including undercoverage and overcoverage and missing or erroneous auxiliary variables on the frame; (2) nonresponse error (unit and item); (3) measurement error (overreporting, underreporting, other); (4) data processing error; (5) modeling/estimation error, such as from fitting models for imputation or adjusting data values to conform to benchmarks; (6) revision error (the difference between preliminary and final published estimates); and (7) specification error (the difference between the true, unobservable variable and the observed indicator). For ongoing surveys, I would add outmoded construct error, which is related to but different from specification error. For example, the Census Bureau’s regular money income concept for official household income and poverty estimates from the CPS Annual Social and Economic Supplement (ASEC) has become progressively outdated due to changing U.S. tax and transfer programs (see, e.g., Czajka and Denmead 2012; National Research Council 1995).

3.2 Four sources of error in U.S. household statistics

3.2.1 Frame deficiencies

Obtaining a comprehensive, accurate frame for surveys can be as difficult as obtaining responses from sample cases drawn from the frame and, in many instances, the difficulties have persisted and even grown over time. Joe Waksberg would resonate to the problem of frame deficiencies: not only did he, with Warren Mitofsky, develop the random digit dialing (RDD) method for generating frames and samples for high-quality residential telephone surveys in the 1970s (see Waksberg 1978; Tourangeau 2004), but he also saw the beginnings of the method’s decline in popularity because of such phenomena as cell-phone-only households.

A commonly used frame for U.S. household surveys is the Census Bureau’s Master Address File (MAF) developed for the decennial census. The past few censuses have obtained increasingly good net coverage of residential addresses on the MAF, particularly for occupied units (Mule and Konicki 2012). The persistent problem for household surveys is undercoverage of individual members within sampled units. Coverage ratios (i.e., estimates before ratio adjustment to population controls) in the March 2013 CPS, for example, are only 85 percent for the total population, and there are marked differences among men and women, older and younger people, and whites and minorities, with coverage ratios as low as 61 percent for black men and women ages 20-24 (see http://www.census.gov/prod/techdoc/cps/cpsmar13.pdf [November 2014]). No systematic study of the time series of coverage ratios for U.S. household surveys has been conducted, but there is evidence that ratios have been getting worse.

While useful to correct coverage errors for age, gender, race and ethnicity groups, the current household survey ratio adjustments undoubtedly fail to correct for other consequential coverage differences. (The ratio-adjustment controls, in one of the least controversial and most long-standing uses of administrative records in U.S. household surveys, derive from population estimates developed from the previous census updated with administrative records and survey data.) Thus, everything that is known about undercount in the U.S. decennial census indicates that, holding race and ethnicity constant, socioeconomically disadvantaged populations are less well counted than others (see, e.g., National Research Council 2004, App. D). It is unlikely that household surveys perform any better - for example, Czajka, Jacobson and Cody (2004) find that the Survey of Income and Program Participation [SIPP] substantially underrepresents high-income families compared with the Survey of Consumer Finances [SCF], which includes a list sample of high-income households drawn from tax records. Factoring in differential socioeconomic coverage, Shapiro and Kostanich (1988) estimate from simulations that poverty is significantly biased downward for black males in the CPS/ASEC. On the other hand, by comparison with the 2000 census long-form sample, Heckman and LaFontaine (2010) find that survey undercoverage in the 2000 CPS October educational supplement contributes little to underestimates of high school completion rates; other factors are more important.

3.2.2  Unit response in secular decline

A study panel of the (U.S.) National Research Council (2013b) recently completed a comprehensive review of causes and consequences of household survey unit nonresponse, documenting the well-known phenomenon that the public is becoming less available and willing to respond to surveys, even from well-trusted official statistical agencies. In the United States, there was evidence as early as the 1980s that response rates had been declining from almost the beginning of the widespread use of probability sample surveys (see, e.g., Steeh 1981; Bradburn 1992). De Leeuw and De Heer (2002) estimated a secular rate of decline in survey cooperation of 3 percentage points per year from examining ongoing surveys in 16 Western countries from the mid-1980s through the late 1990s. The cooperation rate measures the response of eligible sample cases actually contacted; response rates (there are several accepted variations) have broader denominators, including eligible cases that were not reached (National Research Council 2013c, pp. 9-12). National Research Council (2013b: Tables 1-2, 104) provides initial or screener response rates to a range of U.S. official surveys for 1990/91 (after response rates had already fallen significantly for many surveys) and 2007/2009, which make clear that the problem is not going away.

It was long assumed that lower response rates even with nonresponse weighting adjustments inevitably entailed bias in survey estimates. Recent research (see, e.g., Groves and Peytcheva 2008) finds that the relationship between nonresponse and bias is complex and extraordinary efforts to increase response can inadvertently increase bias by obtaining greater response from only some groups and not others (see, e.g., Fricker and Tourangeau 2010). It would be foolhardy, however, for official statistical agencies to assume that increasing nonresponse has no or little effect on the accuracy of estimates, particularly when unit nonresponse is coupled with item nonresponse. For example, nonrespondents to health surveys are estimated to have poorer health on average than respondents and nonrespondents to volunteering surveys are estimated to be less likely to volunteer than respondents (National Research Council 2013b, pp. 44-45). Moreover, there has been little research on the effects of nonresponse on bivariate or multivariate associations or on variance, except for the obvious - and not unimportant - effect that unit nonresponse reduces effective sample size.

3.2.3  Item response often low and declining

Neither sample surveys nor censuses can be expected to obtain answers from unit respondents to every item on a questionnaire. U.S. census practice has long been to edit some items for consistency, but until mid-twentieth century, there were no adjustments for item nonresponse - tables included rows labeled “no response” or similar wording. The first use of imputation occurred in 1940 when Deming developed a “cold deck” procedure to impute age by randomly selecting a value for age from an appropriate deck of cards selected according to what other information was known about the person for whom age was missing. Beginning in 1960, with the advent of high-speed computers, “hot deck” imputation methods were used to impute missing values for many census items (Citro 2012). The hot deck procedure uses the latest value for the previously processed person or household stored in a matrix and, consequently, does not have to assume that data are missing completely at random (MCAR), although it does have to assume that data are missing at random (MAR) within the categories defined by variables in the hot deck matrix. Model-based methods of imputation have been developed that do not require such strong assumptions as MAR or MCAR (see National Research Council 2010b), but they are not widely used in U.S. household surveys. Two exceptions are in the Survey of Consumer Finances (SCF) (Kennickell 2011) and the Consumer Expenditure (CE) Interview Survey (Passero 2009).

Whatever the method, imputation has the advantage of creating a full data record for every respondent, which facilitates multivariate analysis and forestalls the likelihood that researchers will use different methods for treating missing data that give different results. Yet imputation may introduce bias into estimates, and the significance of any bias will likely be magnified by the extent of missing data. So it is troubling that nonresponse has been increasing for important items on household surveys, such as income, assets, taxes and consumer expenditures, which require respondents to supply dollar amounts - for example, Czajka (2009:Table A-8) compares item imputation rates for total income and several sources of income for the CPS/ASEC and SIPP for 1993, 1997 and 2002 - a full one-third of income is currently imputed on the CPS/ASEC, up from about one-quarter in 1993 - and SIPP is not much better. Clearly, with such high imputation rates, careful evaluation of the effects of imputation procedures is imperative to carry out. Hoyakem, Bollinger and Ziliak (2014), for example, estimate that the hot deck imputation procedure for earnings in the CPS/ASEC has consistently underestimated poverty by an average of one percentage point, based on evaluating missing earnings in both the CPS/ASEC and Social Security earnings records.

3.2.4  Measurement error problematic and not well studied

Even with complete reporting, or, more commonly, adjustments for unit and item nonresponse, there will still be error in survey estimates from inaccurate reporting by respondents due to guessing at the answer, deliberately failing to provide a correct answer, or not understanding the intent of the question. While acknowledged by statistical agencies, the extent of measurement error is typically less well studied than is sampling error or the extent of missing data. Many measurement error studies compare aggregate estimates from a survey with similar estimates from another survey or an appropriate set of administrative records, adjusted as far as possible to be comparable. It is not possible to sort out from these studies the part played by measurement error in comparison with other factors, but the results indicate the magnitude of problems. Some studies are able to match individual records and thereby examine components of measurement error.

Significant measurement error is known to affect key socioeconomic estimates produced from U.S. household surveys. Thus, a legion of studies have documented net underestimation of U.S. household income in survey after survey and, even more troubling, a decline in completeness of reporting, even after imputation and weighting. Fixler and Johnson (2012, Table 2), for example, estimated that between 1999 and 2010, mean and median estimates from the CPS/ASEC fell progressively below the National Income and Product Account (NIPA) estimates due to such factors as: (1) underrepresentation of very high-income households in the CPS/ASEC sample; (2) nonreporting and underreporting by those high-income households that are included; and (3) nonreporting and underreporting by middle and lower income households. Studies of individual income sources find even worse error. Meyer and Goerge (2011), for example, by matching Supplemental Nutrition Assistance Program (SNAP) records in two states find that almost 35 percent and 50 percent, respectively, of true recipients do not report receiving benefits in the American Community Survey (ACS) and the CPS/ASEC. Similarly, Meyer, Mok and Sullivan (2009) document large and often increasing discrepancies between survey estimates and appropriately adjusted administrative records estimates of income recipients and total amounts for many sources.

Wealth is notoriously difficult to measure in household surveys, and many do not attempt to do so. Czajka (2009, pp. 143-145) summarizes research on the quality of SIPP estimates of wealth by comparison with the SCF and the Panel Study of Income Dynamics (PSID). Greatly simplifying the findings, SIPP historically has been fairly effective in measuring liabilities, such as mortgage debt, and the value of such assets as owned homes, vehicles, and savings bonds. SIPP has done poorly in measuring the value of assets held mostly by higher income households, such as stocks, mutual funds, and IRA and KEOGH accounts, whereas the PSID has done somewhat better. On net, SIPP significantly underestimates net worth.

A National Research Council (2013a) study of the BLS CE Interview and Diary Surveys found differential quality of reporting of various expenditure types compared with appropriately adjusted personal consumption expenditure (PCE) estimates from the NIPA. Bee, Meyer and Sullivan (2012, Table 2) also find declines in reporting for some expenditures - for example, gasoline reporting in the CE household estimate declined from over 100 percent of the comparable PCE estimate in 1986 to just under 80 percent in 2010, while reporting on furniture and furnishings declined from 77 percent to 44 percent over a comparable period.

Previous | Next

Date modified: