|
|
Survey steps >
Scope and purpose
Statistics Canada is obligated by law to protect the confidentiality
of respondents’ information. Disclosure control refers
to the measures taken to protect the Agency’s data in a way that
the confidentiality requirements are not violated. The direct impact of
disclosure control activities on the quality of the data is usually a
limiting one, in that some data detail may have to be suppressed or modified.
The goal is thus to ensure that the confidentiality protection provisions
are met while preserving the usefulness of the data outputs to the greatest
extent possible. Statistics Canada's vigilant disclosure control/confidentiality
protection program has made a significant contribution to the quality
of the Agency’s data and has resulted in high response rates the
Agency’s surveys enjoy and in the public’s confidence in the
Agency as a whole.
Principles
The principles of disclosure control activities are governed, almost entirely,
by the legal provisions of the Statistics Act (1970, R.S.C. 1985,
c. S19), specifically by subsection 17(1) which reads as follows:
"no person who has been sworn in under section 6 shall disclose
or knowingly cause to be disclosed, by any means, any information obtained
under this Act in such a manner that it is possible from the disclosure
to relate the particulars obtained from any individual return to any
identifiable individual person, business or organization."
However, subsection 17(2) does provide for the release of selected types
of confidential information at the discretion of the Chief Statistician
and by order. The most common types of such releases are lists of businesses
with their addresses and industrial classifications or information relating
to an individual respondent if that respondent has consented to the disclosure
in writing. The release of information using the Chief Statistician's
discretion is governed by the Policy
on Discretionary Release (Statistics Canada, 1993a) and, in some cases,
by the Guidelines on the Release of Unscreened Microdata under the Terms
of Section 12 Data Sharing Agreements or Discretionary Release Provisions.
The confidentiality provisions of the Statistics Act are extremely rigorous.
Consequently, the translation of their meaning to specific applications
is, in practice, a difficult but extremely important task. The primary
goal is to ensure that no identifiable individual return's data can be
inferred to within a narrow range. Furthermore, it is necessary to protect
information whether or not it concerns something likely to be considered
sensitive by respondents; thus, basic demographic characteristics must
be protected, just as much as income. It is important to note that there
is no reference in the legislation to any time limits on the protection
of information from disclosure. As well, the public perception that the
Agency is vigilant in protecting the confidentiality of its data holdings
is as important as the reality of what the Agency actually does to protect
respondents' data from being disclosed.
Guidelines
- Distinguish between tabular data and microdata releases.
In the case of tabular data, the data are released in the form of statistical
tables, sometimes over many dimensions, whereas for microdata, anonymized
records for individuals are produced. Tabular data can be classified
into frequency tables or tables of magnitudes. Frequency
tables give only counts (or estimated counts) of the number of units
that fall into each of the cells of the table, whereas tables of magnitudes
give numeric (usually non-negative) values, such as means or totals
of dollar values, or number of employees in each cell. Measures that
ensure confidentiality protection for these diverse products are necessarily
very different.
- Do not release a table of magnitude data if it provides values for
cells that are considered to be sensitive. The criteria for sensitivity
are usually based on simple rules that are generally believed to guard
against disclosure of an individual respondent's characteristics.
- Determine the sensitivity of each cell. Two criteria are usually used.
One is the number of respondents in the cell, and the other is based
on measures of concentration or predominance of the distribution of
the respondents' values within the cell. An example of the former is
simply that the number of respondents in a cell must exceed some minimum
value. For many surveys, tables with cells having only three respondents
may be released. Less than three is unacceptable, since if there are
only two respondents, then one of the respondents could derive the value
for the other respondent by simple subtraction.
- There are many cell suppression rules that are based on measures of
concentration. Determine which concentration measure is to be used.
The easiest ones to implement are rules that are based on linear combinations
of order statistics. One such common rule is known as the (n,k)
rule. In this case, a cell is sensitive if the largest n respondents
in it account for at least k% of the total cell value. Often more than
one value of n is controlled, say n=1 or 2. In some cases, different
values of k are used according to the number of respondents in the cell,
but this is not advisable, since the addition of a new respondent with
negligible contribution could change a sensitive cell into a non-sensitive
one, which is intuitively unreasonable. This is due to the discontinuity
of the rules.
- The p-percent rule is also based on a measure of concentration
(Subcommittee on Disclosure Limitation Methodology, 1994). It is meant
to ensure that a coalition of units, typically the unit with the second-largest
value, cannot estimate the largest unit’s value too closely. An
example of such a rule, with p=15, would be to declare a cell sensitive
if the sum of the values of the third largest and all lower ranking
respondents' values was less than 15% of the largest respondent’s
value. An extension of the p-percent is the pq rule, where
the value q (p<q<100) represents the organization’s estimate
as to how accurately respondents can determine other units’ values.
A pq rule with p=15 and q=60 is equivalent to a p-percent rule with
p=25.
- Determine if zero frequency cells represent a problem. Zero frequency
cells may reveal sensitive information in tables of magnitude data.
- Delete sensitive cells from a table. Such corrective action is known
as cell suppression. A problem arises, however, because suppressing
only the sensitive cells is often not sufficient when marginal totals
are also released, because it may be possible to obtain the exact value
of the suppressed cell by solving a system of linear equations. Even
if this is not possible, one can derive a range of values for the suppressed
cell, through linear programming methods, and this range may be deemed
to be too narrow to give ample protection to the suppressed value. As
a result, find complementary cells to suppress in order to protect the
sensitive cell. The problem of finding complementary cells is further
complicated by the possible presence of hierarchies in the table classification
variables (e.g., different levels of industrial coding) and the output
of sets of related tables. Sophisticated software exists to identify
complementary cells, although not all such packages address the issues
of hierarchies and related tables adequately.
- Consider alternative methods to cell suppression. One method is to
change the row and column definitions, by collapsing categories,
by regrouping or by top coding the category values, so that
none (or fewer) of the cells are sensitive. Other possible methods include
perturbing data through the addition of noise to the microdata, or the
addition of noise to the tabular data, such as rounding. Any procedure
to make the underlying microdata file safe could be used to protect
the tabular data, and then all tabulations would be run from the "safe"
microdata file.
- Rounding the cell values can take a number of different forms. Often
conventional or deterministic rounding will not add enough noise to
give sufficient protection. Consider the use of random rounding.
- In frequency tables, low frequency cells may be problematic. Individuals
in such cells may be easily identified, so that it becomes known that
all other members of the population belong to some other cell. It is
certainly true that if only one cell in a given row or column is non-zero,
and the membership of such a row or column is known, then disclosure
has taken place. When necessary, implement controls to prevent the distributions
for given rows or columns from being concentrated in a small number
of categories. In particular, when columns (or rows) define ranges of
a magnitude variable, say income, ensure that the nonzero cells in each
row (column) span a sufficiently large range of possible values for
income.
- Techniques for reducing the disclosure risk in frequency tables include
all those used for magnitude tables, that is, cell suppression; changing
the row and column definitions by collapsing categories or by regrouping
or top coding the category values; perturbing data through the addition
of noise to the microdata or the addition of noise to the tabular data,
such as rounding; and other procedures that make the microdata file
from which the tabulations are run safe from disclosure.
- Ensure that all releases of public use microdata files are reviewed
by The Microdata Release Committee (Statistics Canada, 1987).
- In the case of microdata releases, individual records rather than
aggregated data are being published, and the disclosure criteria for
such files are very different. Even though microdata files do not contain
identifying information such as names and telephone numbers, they contain
a number of variables, called key variables, that, in combination,
can serve to identify unique individuals in the population who may be
on the file. Identification would be equivalent to a disclosure of the
microdata characteristics for these individuals. Note that, even if
the individuals identified are not truly unique, or if they have been
wrongly identified, the appearance of a disclosure can sometimes be
as harmful to the Agency as an actual case of disclosure.
- Assess the risk of disclosure for microdata files. The number and
nature of key variables can affect the disclosure risk. Some identifying
characteristics, such as detailed geography or exact income, are considered
to present a higher disclosure risk. On the other hand, a lower level
of quality, such as the presence of measurement errors or of imputed
values, can lower the risks associated with certain characteristics.
Disclosure risks increase with the sampling rate, and microdata should
not be released for a 100% sample. Similarly, microdata files should
not contain 100% samples within identifiable strata or sub-groups. Characteristics
of the surveyed population itself can also affect the disclosure risk.
Microdata files for businesses are rarely released because of the concentrated
nature of business data. The presence of hierarchical relations between
units can also affect the disclosure risk.
- There are two general methods to control the disclosure risk for microdata
files. Data reduction methods include sampling, ensuring that
the populations for certain identifiable groups are sufficiently large,
making the variable categories wider, top and bottom coding, removing
some of the variables from some respondents, or removing some of the
respondents from the file. Data modification methods include
adding random noise to the microdata, data swapping, replacing small
groups with average values, or deleting information from some respondents
and replacing it with imputed values.
- An even more difficult problem arises when dealing with strategies
to release microdata files from longitudinal surveys. In this case,
determine an appropriate strategy before the longitudinal survey has
run its full course. This implies that the strategy must be defined
in the absence of the full survey results, that is, prior to collecting
the data for future waves of the survey. Since one of the objectives
of this strategy is to define the variables to be released and their
respective categorization, certain assumptions need to be made about
how these variables evolve over time, and whether this evolution can
lead to certain variables becoming key variables.
- Data reduction and data modification methods are known as restricted
data methods. As an alternative to releasing microdata files, consider
using restricted access methods such as remote access or research data
centres. Under remote access, researchers do not have direct access
to the Agency’s survey data, but they can e-mail an analytical
program that is run on the microdata residing within the Agency. The
program outputs are screened by Agency staff and, if they present no
disclosure risk, they are e-mailed to the researcher. Statistics Canada’s
Research Data Centres are secure settings where researchers with approved
projects and who are “sworn-in” as deemed employees under
the Statistics Act can have access to confidential microdata.
The centres operate like extensions of Statistics Canada and are staffed
by full-time Statistics Canada employees. Only non-confidential results
are allowed to leave the centres.
- Although there are many rules for ensuring confidentiality protection,
the rules cannot replace common sense. For example, rules to avoid all
residual disclosures resulting from multiple releases from the same
basic database are difficult to define, especially in the case of ad
hoc requests, so that some manual intervention becomes necessary. There
are still many unanswered questions in this area, and research is needed
to ensure that as much data can be released as possible, without violating
the confidentiality requirements.
- Use generalized disclosure control software instead of custom-built
systems whenever possible. Possible software packages to use include
the Agency's cell suppression software, CONFID (Statistics Canada, 2002d)
or the ?-ARGUS Software (Hundepool et al, 2002). By using generalized
systems, one can expect fewer programming errors, as well as some reduction
in development costs and time.
- Make use of resources available within Statistics Canada on matters
of confidentiality when necessary. Consult the Data Access and Control
Services Division on policy matters relating to the confidentiality
of the information collected by Statistics Canada, the Confidentiality
and Legislation Committee and its subcommittees: Discretionary Release
Committee, Disclosure Review Committee, and Microdata Release Committee
on issues related to disclosure control strategies and practices, and
the Disclosure Control Resource Centre for technical assistance.
References
Brackstone, G. and White, P. (2002). Data stewardship at Statistics Canada.
Proceedings of the Social Statistics Section, American
Statistical Association, 284-293.
Doyle, P., Lane, J.I., Theeuwes, J.J.M. and Zayatz, L.V. (eds.) (2001).
Confidentiality, Disclosure, and Data Access: Theory and Practical
Applications for Statistical Agencies. North-Holland.
Eurostat (1996). Manual on Disclosure Control Methods.
Luxembourg: Office for Official Publications of the European Communities.
Hundepool, A., van de Wetering, A., de Wolf, P.-P., Giessing, S., Fischetti,
M., Salazar, J.-J. and Caprara, A. (2002). ?-ARGUS user manual 2.1. Statistics
Netherlands, Voorburg. See also http://neon.vb.cbs.nl/casc.
Statistics Canada (1970). The Statistics Act. Ottawa,
Canada.
Statistics Canada (1987). Policy
on Microdata Release. Policy Manual, 4.2.
Statistics Canada (1993a). Discretionary
Release Policy. Policy Manual, 4.3.
Statistics Canada (2002d). User's Guide – Generalized Sensitivity
Analysis and Protection System. Internal document, System Development
Division.
Subcommittee on Disclosure Limitation Methodology, Federal Committee
on Statistical Methodology (1994). Report on statistical disclosure limitation
methodology. Statistical Policy Working Paper 22, Office of Management
and Budget, Washington, DC.
Willenborg, L. and de Waal, T. (1996). Statistical Disclosure
Control in Practice. Lecture Notes in Statistics, Springer-Verlag,
New York.
Willenborg, L. and de Waal, T. (2001). Elements of Statistical
Disclosure Control. Lecture Notes in Statistics, Springer-Verlag,
New York.
|