12-539 Data Quality Guidelines

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Survey steps >

Disclosure control

Scope and purpose

Statistics Canada is obligated by law to protect the confidentiality of respondents’ information. Disclosure control refers to the measures taken to protect the Agency’s data in a way that the confidentiality requirements are not violated. The direct impact of disclosure control activities on the quality of the data is usually a limiting one, in that some data detail may have to be suppressed or modified. The goal is thus to ensure that the confidentiality protection provisions are met while preserving the usefulness of the data outputs to the greatest extent possible. Statistics Canada's vigilant disclosure control/confidentiality protection program has made a significant contribution to the quality of the Agency’s data and has resulted in high response rates the Agency’s surveys enjoy and in the public’s confidence in the Agency as a whole.

Principles

The principles of disclosure control activities are governed, almost entirely, by the legal provisions of the Statistics Act (1970, R.S.C. 1985, c. S19), specifically by subsection 17(1) which reads as follows:

"no person who has been sworn in under section 6 shall disclose or knowingly cause to be disclosed, by any means, any information obtained under this Act in such a manner that it is possible from the disclosure to relate the particulars obtained from any individual return to any identifiable individual person, business or organization."

However, subsection 17(2) does provide for the release of selected types of confidential information at the discretion of the Chief Statistician and by order. The most common types of such releases are lists of businesses with their addresses and industrial classifications or information relating to an individual respondent if that respondent has consented to the disclosure in writing. The release of information using the Chief Statistician's discretion is governed by the Policy on Discretionary Release (Statistics Canada, 1993a) and, in some cases, by the Guidelines on the Release of Unscreened Microdata under the Terms of Section 12 Data Sharing Agreements or Discretionary Release Provisions.

The confidentiality provisions of the Statistics Act are extremely rigorous. Consequently, the translation of their meaning to specific applications is, in practice, a difficult but extremely important task. The primary goal is to ensure that no identifiable individual return's data can be inferred to within a narrow range. Furthermore, it is necessary to protect information whether or not it concerns something likely to be considered sensitive by respondents; thus, basic demographic characteristics must be protected, just as much as income. It is important to note that there is no reference in the legislation to any time limits on the protection of information from disclosure. As well, the public perception that the Agency is vigilant in protecting the confidentiality of its data holdings is as important as the reality of what the Agency actually does to protect respondents' data from being disclosed.

Guidelines

Distinguish between tabular data and microdata releases. In the case of tabular data, the data are released in the form of statistical tables, sometimes over many dimensions, whereas for microdata, anonymized records for individuals are produced. Tabular data can be classified into frequency tables or tables of magnitudes. Frequency tables give only counts (or estimated counts) of the number of units that fall into each of the cells of the table, whereas tables of magnitudes give numeric (usually non-negative) values, such as means or totals of dollar values, or number of employees in each cell. Measures that ensure confidentiality protection for these diverse products are necessarily very different.
Do not release a table of magnitude data if it provides values for cells that are considered to be sensitive. The criteria for sensitivity are usually based on simple rules that are generally believed to guard against disclosure of an individual respondent's characteristics.
Determine the sensitivity of each cell. Two criteria are usually used. One is the number of respondents in the cell, and the other is based on measures of concentration or predominance of the distribution of the respondents' values within the cell. An example of the former is simply that the number of respondents in a cell must exceed some minimum value. For many surveys, tables with cells having only three respondents may be released. Less than three is unacceptable, since if there are only two respondents, then one of the respondents could derive the value for the other respondent by simple subtraction.
There are many cell suppression rules that are based on measures of concentration. Determine which concentration measure is to be used. The easiest ones to implement are rules that are based on linear combinations of order statistics. One such common rule is known as the (n,k) rule. In this case, a cell is sensitive if the largest n respondents in it account for at least k% of the total cell value. Often more than one value of n is controlled, say n=1 or 2. In some cases, different values of k are used according to the number of respondents in the cell, but this is not advisable, since the addition of a new respondent with negligible contribution could change a sensitive cell into a non-sensitive one, which is intuitively unreasonable. This is due to the discontinuity of the rules.
The p-percent rule is also based on a measure of concentration (Subcommittee on Disclosure Limitation Methodology, 1994). It is meant to ensure that a coalition of units, typically the unit with the second-largest value, cannot estimate the largest unit’s value too closely. An example of such a rule, with p=15, would be to declare a cell sensitive if the sum of the values of the third largest and all lower ranking respondents' values was less than 15% of the largest respondent’s value. An extension of the p-percent is the pq rule, where the value q (p<q<100) represents the organization’s estimate as to how accurately respondents can determine other units’ values. A pq rule with p=15 and q=60 is equivalent to a p-percent rule with p=25.
Determine if zero frequency cells represent a problem. Zero frequency cells may reveal sensitive information in tables of magnitude data.
Delete sensitive cells from a table. Such corrective action is known as cell suppression. A problem arises, however, because suppressing only the sensitive cells is often not sufficient when marginal totals are also released, because it may be possible to obtain the exact value of the suppressed cell by solving a system of linear equations. Even if this is not possible, one can derive a range of values for the suppressed cell, through linear programming methods, and this range may be deemed to be too narrow to give ample protection to the suppressed value. As a result, find complementary cells to suppress in order to protect the sensitive cell. The problem of finding complementary cells is further complicated by the possible presence of hierarchies in the table classification variables (e.g., different levels of industrial coding) and the output of sets of related tables. Sophisticated software exists to identify complementary cells, although not all such packages address the issues of hierarchies and related tables adequately.
Consider alternative methods to cell suppression. One method is to change the row and column definitions, by collapsing categories, by regrouping or by top coding the category values, so that none (or fewer) of the cells are sensitive. Other possible methods include perturbing data through the addition of noise to the microdata, or the addition of noise to the tabular data, such as rounding. Any procedure to make the underlying microdata file safe could be used to protect the tabular data, and then all tabulations would be run from the "safe" microdata file.
Rounding the cell values can take a number of different forms. Often conventional or deterministic rounding will not add enough noise to give sufficient protection. Consider the use of random rounding.
In frequency tables, low frequency cells may be problematic. Individuals in such cells may be easily identified, so that it becomes known that all other members of the population belong to some other cell. It is certainly true that if only one cell in a given row or column is non-zero, and the membership of such a row or column is known, then disclosure has taken place. When necessary, implement controls to prevent the distributions for given rows or columns from being concentrated in a small number of categories. In particular, when columns (or rows) define ranges of a magnitude variable, say income, ensure that the nonzero cells in each row (column) span a sufficiently large range of possible values for income.
Techniques for reducing the disclosure risk in frequency tables include all those used for magnitude tables, that is, cell suppression; changing the row and column definitions by collapsing categories or by regrouping or top coding the category values; perturbing data through the addition of noise to the microdata or the addition of noise to the tabular data, such as rounding; and other procedures that make the microdata file from which the tabulations are run safe from disclosure.
Ensure that all releases of public use microdata files are reviewed by The Microdata Release Committee (Statistics Canada, 1987).
In the case of microdata releases, individual records rather than aggregated data are being published, and the disclosure criteria for such files are very different. Even though microdata files do not contain identifying information such as names and telephone numbers, they contain a number of variables, called key variables, that, in combination, can serve to identify unique individuals in the population who may be on the file. Identification would be equivalent to a disclosure of the microdata characteristics for these individuals. Note that, even if the individuals identified are not truly unique, or if they have been wrongly identified, the appearance of a disclosure can sometimes be as harmful to the Agency as an actual case of disclosure.
Assess the risk of disclosure for microdata files. The number and nature of key variables can affect the disclosure risk. Some identifying characteristics, such as detailed geography or exact income, are considered to present a higher disclosure risk. On the other hand, a lower level of quality, such as the presence of measurement errors or of imputed values, can lower the risks associated with certain characteristics. Disclosure risks increase with the sampling rate, and microdata should not be released for a 100% sample. Similarly, microdata files should not contain 100% samples within identifiable strata or sub-groups. Characteristics of the surveyed population itself can also affect the disclosure risk. Microdata files for businesses are rarely released because of the concentrated nature of business data. The presence of hierarchical relations between units can also affect the disclosure risk.
There are two general methods to control the disclosure risk for microdata files. Data reduction methods include sampling, ensuring that the populations for certain identifiable groups are sufficiently large, making the variable categories wider, top and bottom coding, removing some of the variables from some respondents, or removing some of the respondents from the file. Data modification methods include adding random noise to the microdata, data swapping, replacing small groups with average values, or deleting information from some respondents and replacing it with imputed values.
An even more difficult problem arises when dealing with strategies to release microdata files from longitudinal surveys. In this case, determine an appropriate strategy before the longitudinal survey has run its full course. This implies that the strategy must be defined in the absence of the full survey results, that is, prior to collecting the data for future waves of the survey. Since one of the objectives of this strategy is to define the variables to be released and their respective categorization, certain assumptions need to be made about how these variables evolve over time, and whether this evolution can lead to certain variables becoming key variables.
Data reduction and data modification methods are known as restricted data methods. As an alternative to releasing microdata files, consider using restricted access methods such as remote access or research data centres. Under remote access, researchers do not have direct access to the Agency’s survey data, but they can e-mail an analytical program that is run on the microdata residing within the Agency. The program outputs are screened by Agency staff and, if they present no disclosure risk, they are e-mailed to the researcher. Statistics Canada’s Research Data Centres are secure settings where researchers with approved projects and who are “sworn-in” as deemed employees under the Statistics Act can have access to confidential microdata. The centres operate like extensions of Statistics Canada and are staffed by full-time Statistics Canada employees. Only non-confidential results are allowed to leave the centres.
Although there are many rules for ensuring confidentiality protection, the rules cannot replace common sense. For example, rules to avoid all residual disclosures resulting from multiple releases from the same basic database are difficult to define, especially in the case of ad hoc requests, so that some manual intervention becomes necessary. There are still many unanswered questions in this area, and research is needed to ensure that as much data can be released as possible, without violating the confidentiality requirements.
Use generalized disclosure control software instead of custom-built systems whenever possible. Possible software packages to use include the Agency's cell suppression software, CONFID (Statistics Canada, 2002d) or the ?-ARGUS Software (Hundepool et al, 2002). By using generalized systems, one can expect fewer programming errors, as well as some reduction in development costs and time.
Make use of resources available within Statistics Canada on matters of confidentiality when necessary. Consult the Data Access and Control Services Division on policy matters relating to the confidentiality of the information collected by Statistics Canada, the Confidentiality and Legislation Committee and its subcommittees: Discretionary Release Committee, Disclosure Review Committee, and Microdata Release Committee on issues related to disclosure control strategies and practices, and the Disclosure Control Resource Centre for technical assistance.

Top of Page

References

Brackstone, G. and White, P. (2002). Data stewardship at Statistics Canada. Proceedings of the Social Statistics Section, American Statistical Association, 284-293.

Doyle, P., Lane, J.I., Theeuwes, J.J.M. and Zayatz, L.V. (eds.) (2001). Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland.

Eurostat (1996). Manual on Disclosure Control Methods. Luxembourg: Office for Official Publications of the European Communities.

Hundepool, A., van de Wetering, A., de Wolf, P.-P., Giessing, S., Fischetti, M., Salazar, J.-J. and Caprara, A. (2002). ?-ARGUS user manual 2.1. Statistics Netherlands, Voorburg. See also http://neon.vb.cbs.nl/casc.

Statistics Canada (1970). The Statistics Act. Ottawa, Canada.

Statistics Canada (1987). Policy on Microdata Release. Policy Manual, 4.2.

Statistics Canada (1993a). Discretionary Release Policy. Policy Manual, 4.3.

Statistics Canada (2002d). User's Guide – Generalized Sensitivity Analysis and Protection System. Internal document, System Development Division.

Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology (1994). Report on statistical disclosure limitation methodology. Statistical Policy Working Paper 22, Office of Management and Budget, Washington, DC.

Willenborg, L. and de Waal, T. (1996). Statistical Disclosure Control in Practice. Lecture Notes in Statistics, Springer-Verlag, New York.

Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. Lecture Notes in Statistics, Springer-Verlag, New York.

Home \| Search \| Contact Us \| Français
Date Modified: 2014-04-10	Important Notices