Disclosure control

Scope and purpose
Principles
Guidelines
Quality indicators
References

Scope and purpose

Disclosure control refers to the measures taken to protect data in accordance with confidentiality requirements. The goal is to ensure that the confidentiality protection provisions are met while preserving the usefulness of the data outputs to the greatest extent possible. Statistics Canada's vigilant disclosure control and confidentiality protection program contributes greatly to data quality. In fact, the high response rates to the Agency's surveys and the public's confidence in it are in a large measure attributable to this.

Principles

The principles of disclosure control activities are almost entirely governed by the provisions of the Statistics Act (1970, R.S.C. 1985, c. S19), specifically paragraph 17(1)(b):
No person who has been sworn under section 6 shall disclose or knowingly cause to be disclosed, by any means, any information obtained under this Act in such a manner that it is possible from the disclosure to relate the particulars obtained from any individual return to any identifiable individual person, business or organization.

The Statistics Act's confidentiality provisions are extremely rigorous. Consequently, enforcing them in specific cases is a difficult, though extremely important, task. The first goal is to ensure that no identifiable personal information may be inferred within a limited range. Moreover, information must be protected whether or not the subject might be considered confidential by respondents. Finally, how the public perceives the vigilance with which we protect the confidentiality of statistics is at least as important as the actual measures we take to prevent respondents' data from being disclosed.

Guidelines

General

  • Distinguish among the types of data to be processed with each type having its own disclosure control methods. Tabular data are released in the form of statistical tables that are often multi-dimensional. These are further classified as frequency tables and tables of magnitudes. Microdata consist of de-identified records produced for individuals. Finally, some analytical output data can also require disclosure control, particularly if they resemble tabular data (e.g. statistics or histograms) or microdata (e.g. scatter plots or residual values from regressions).

  • Check the Disclosure Control Guidelines (long version) to determine which control methods are the most appropriate for your types of data. Limited access methods include access to data from identified data centres, secure remote access and limited access under licence contracts. Limited release methods protect the data itself by reducing or perturbing information.

  • Do not disclose the parameters and rules used to control disclosure. Knowing these parameters can help determine more accurately the values of certain respondents.

  • Always remember that apparent disclosure can sometimes be just as harmful to the Agency as actual disclosure.

Residual disclosure

  • Consider the risk of residual disclosure. This occurs when confidential data can be estimated by cross-referencing released information with other accessible information, including previous releases by the Agency.

  • In tables, it is sometimes necessary to find complementary cells to suppress to protect confidential cells. Zero-frequency cells can also pose an attribute disclosure problem, since they eliminate certain possibilities (for example, a zero frequency for the "has a job" category). Often, simply suppressing confidential cells is not sufficient when the marginal totals are also released because it may be possible to calculate the exact value of suppressed cells by solving a system of linear equations. Even if that is not possible, one may derive a range of values for suppressed cells using linear programming methods, and that range might be deemed to be too narrow to sufficiently protect the suppressed value.

  • Check whether the categories and hierarchies used by the tables overlap. For example, publishable regions can be subtracted from larger regions, resulting in the publication of a region whose values should be confidential.

  • Residual disclosure also occurs when confidential data can be estimated by cross-referencing released information with other accessible information, including previous releases by the Agency. It is hard to define rules to prevent disclosure by cross-reference when several products are released from the same database, particularly in the case of ad hoc requests or output from data centres. Manual intervention is sometimes required. If data can be released from several centres, releases must be coordinated or at least common release rules must be established.

Microdata

  • Consider disclosure control methods that are appropriate to microdata dissemination. Data reduction methods include sampling, broadening variable categories (in the case of certain identifiable groups, ensure that the population is large enough), top and bottom coding, removing certain variables from some or all respondents and suppressing some respondents in the file. Data modification methods include adding random noise to the microdata, swapping data, replacing individual values in small groups with average values or deleting information from certain respondents and replacing it with imputed values.

  • In longitudinal surveys, define an appropriate strategy before the survey ends. Strategies for releasing microdata files from longitudinal surveys pose an even stickier problem. The strategy must be developed before all the results of the survey are available, i.e. before data is collected for future waves of the survey. Since one of this strategy's objectives is to define the variables to be released and categorize them, certain assumptions must be made about how those variables evolve over time, particularly if some of them might become key variables.

  • In the case of follow-up or second-phase surveys, if the main survey has released or plans to release a microdata file, ensure that the microdata file does not pose any additional risks by allowing a composite file to be created by linking the microdata from the two surveys. Assess the success rate achieved by linking the two files and, if it is high, the risk posed by such linkage (e.g. what are the consequences of adding identification variables from one survey to the other).

  • In accordance with the Policy on Microdata Release (Statistics Canada, 1987) ensure that the Microdata Release Committee reviews all public use microdata files.

Disclosure of certain types of information

  • See subsection 17(2) of the Statistics Act,which provides that certain types of confidential information may be released at the discretion of the Chief Statistician and by order. The most common types of such releases are lists of businesses with their addresses and industrial classifications or information related to respondents who have given their consent in writing (waiver). The release of information under the Chief Statistician's discretionary powers is governed by the Discretionary Release Policy (Statistics Canada, 2004) and, in some cases, by the Guidelines on the Release of Unscreened Microdata under the data sharing agreements described in section 12 or the Act's discretionary information release provisions.

Resources

  • See the confidentiality resources available at Statistics Canada:

    • The Data Access and Control Services Division provides opinions and advice on policies related to the confidentiality of the information collected by Statistics Canada;

    • The Confidentiality and Legislation Committee and its subcommittees, the Disclosure Avoidance Review Group and the Microdata Release Committee provide disclosure control strategies and practices;

    • The Business Survey Methods Division's Disclosure Control Resource Centre provides technical assistance as well as the generalized systems support team for the Confid software package.

  • Use a generalized disclosure control software package such as Confid rather than customized systems. Such systems reduce the risk of implementation and execution error, the risk of disclosure and the risk of "overprotecting" data, while reducing development costs and times.

Quality indicators

Main quality elements: acuracy, accessibility

In general, disclosure control measures reduce data quality by suppressing data or changing detail levels. Disclosure control can also result in access to data being limited to certain groups such as researchers. Certain methods such as data perturbation can affect the accuracy of information released. Bias might arise from value rounding or noise addition.

It is impossible to guarantee absolute confidentiality. Disclosure control is quite complicated and the rules used to measure the extent of the protection provided are somewhat subjective. Although there is no consensus on quality measures, risk functions and loss functions are found primarily.

A loss function measures the extent of the difference between the original data and the data after disclosure control methods have been applied. For altered data (e.g. perturbation), the relative difference between the data before and after adjustment for confidentiality is measured. In the case of suppressed data, the suppression rate indicating the number of values suppressed compared to those released is often used. These indices must be produced at different detail levels and for various respondent groups (e.g. to identify the industrial groups most affected by the suppression). 

To a certain extent, a risk function indicates the risk of identifying respondents or values associated with them. In general, for data suppressed in tables, the number of suppressed cells for which protection is inadequate ( i.e. a too accurate approximation of the suppressed value can be obtained using information from other cells) must be identified. In the case of microdata, methods tend to measure the risk of disclosure using the re-identification method for a set of characteristic variables (called key variables) or by measuring matching attempts with an external file. Overall, the technique consists of identifying unique combinations of the population found in the released dataset.

References

Brackstone, G. and P. White. 2002. "Data stewardship at Statistics Canada." Proceedings of the Social Statistics Section. American Statistical Association. p. 284-293.

Doyle, P., J. Lane, J. Theeuwes and L. Zayatz (eds.) 2001. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland. 462 p.

Elliot, M., A. Hundepool, E. Schulte Nordholt, J.L. Tambay and T. Wende. 2005. Glossary on Statistical Disclosure Control. http://neon.vb.cbs.nl/casc/glossary.htm.

Federal Committee on Statistical Methodology. 2005. Report on Statistical Disclosure Limitation Methodology, Statistical Policy Working Paper 22, Second version. Office of Management and Budget. Washington, D.C.

Hundepool, A., et al. 2008a. τ-ARGUS version 3.3 User's Manual. Voorburg. Statistics Netherlands.

Hundepool, A., et al. 2008b. μ-ARGUS version 4.2 User's Manual. Voorburg.Statistics Netherlands.

Hundepool, A. et al. 2009. Handbook on Statistical Disclosure Control, Version 1.1. EssNet SDC.

Statistics Canada. 1970. The Statistics Act. Ottawa, Canada.

Statistics Canada. 1987. "Policy on Microdata Release." Statistics Canada Policy Manual. Section 4.2. Last updated March 4, 2009.

Statistics Canada. 2004. "Discretionary Release Policy." Statistics Canada Policy Manual. Section 4.3. Last updated March 4, 2009.

UN Economic Commission for Europe. 2007. Managing Statistical Confidentiality and Microdata Access – Principles and Guidelines of Good Practice. United Nations, Geneva.

Willenborg, L. and T. de Waal. 1996. Statistical Disclosure Control in Practice. Springer Verlag. Lecture Notes in Statistics. Vol. 111.

Willenborg, L. and T. de Waal. 2000. Elements of Statistical Disclosure Control. Springer Verlag. Lecture Notes in Statistics. Vol. 155.