A layered perturbation method for the protection of tabular outputs
Section 3. The Layered Perturbation Method (LPM)
3.1 Description
The LPM is a
perturbative method for totals that focuses on disclosure from differencing.
When used in tables of magnitude it allows cell suppression to be restricted to
sensitive cells. Three basic ideas underlie the LPM. The first two are similar
to the TCM approach.
The first basic
idea is the attachment of pseudo-random hash numbers (PRNs) to units to produce
consistent perturbation outcomes when needed. This discourages the use of
repeated queries to improve the estimation of unperturbed totals. The EZS
method is used to multiply the value of a unit
by
a weight
with
as above. To obtain consistent results
are
generated from a unit-specific PRN that is uniformly distributed over
For
example, use
where
are
generated from the Social Insurance Number (e.g.,
for
a
large prime). Using
will
always perturb unit
the same
way. To perturb unit
the same
way only when it appears in the same cell total, generate cell-unit level noise
from
where
Primes are used to designate cell-unit specific
noises and perturbations. All noise values are derived from
or
The second idea is
the application of perturbation to units in each cell by layers. The largest
four units are perturbed in a random but consistent manner using perturbation weights
generated from
The next
largest units, say units 5 to 9, are perturbed in a semi-consistent manner.
Their perturbation is a mixture of unit specific weights
and
unit-cell specific weights
Smallest
units are not perturbed. Their values are protected from differencing by the
unit-cell perturbations of units 5 to 9 since adding or removing a unit in a
cell, no matter how small, will affect the
for those
units. The number of units per layer is flexible, we have found that four and
five, respectively gave satisfactory results.
A third set of
measures mostly targets the issue of differencing. The direction of noise for
even-ranked units is reversed
are set
from
to
increase variances of differences when a top-ranked unit is changed. For units
5 to 9 a random mixture of
and
is
applied to lessen the risk when a small unit is added or removed. Finally, the
noise for the top three units is amplified in nonsensitive cells with greater
dominance. This allows lower levels of noise to be used generally, reducing the
overall impact of the perturbation on data quality.
A suggested
application of the LPM would consist of suppressing all sensitive and small
cells (e.g.,
and
perturbing remaining cells. Because of the protection offered by perturbation,
cells that are slightly sensitive may also be publishable. For other cells with
cell total
set
perturbed value
as
and
are set to increase the noise of
when needed (set
and
The
are
random variables that are independent of
e.g.,
or
3.2 Some results
Let
i.i.d.
and let
and
be fixed
(for now). It follows that:
Let
and
equal
and
for the cell after removing units 1, 2 and 3,
respectively. Keeping subscripts from the original cell (i.e., subscript 2
refers to the unit that was second in
we have:
We can obtain
for other units similarly. If we estimate the
dropped units as
it can
be shown that, with
If we assume
that
and
are
fixed we can set them based on some requirement for
For example, we may want to have
since,
for
which for
gives
To obtain
we
can solve (fixed)
and
in
reverse order. This gives
In practice,
and
are
bounded below at 1 and above at some threshold value less than 2, and
is
bounded below at 1 and can taper off above the threshold. Also, the target
values of
and
depend
on the situation in each cell. Here, for simplicity of illustration, they were
assumed not to change when we removed observations from the cell.
Using the same
noise and changing its direction for even-ranked units means that we take
advantage of the correlation between the
and
to
increase the variance of
For example, the contribution to
from
unit 2 is
If we
had used independent (or unit-cell specific) noises
instead of
for units 1 to 4 the contribution from unit 2
would have been only
3.3 Comparison with the EZS and TCM approaches
With EZS the
perturbed cell total is simply
giving
For any unit
we
have
and
which is smaller than the equivalent variance
with the LPM for the same level of noise
even when we set
A
possible exception could be unit 5, if subsequent units are relatively quite
small. This can be seen by examining
above.
The TCM applies
three multiplicative perturbation factors to the largest, say 4, units in each
cell. A magnitude component
determines
the relative size of the perturbation for the
ranked
unit. The
are
fixed; typically
e.g.,
A
permanent random factor
fixes
the direction of the noise for each unit
A
pseudo-random factor
determines
unit-cell specific noises. This gives
The method can be represented in a form
comparable to LPM, with
and
The
way the
are
fixed is a major difference with the LPM that greatly diminishes the protection
offered to
To
illustrate this, consider two adaptions of these methods that yield identical
variances for
where the
same notational conventions as before are used, with fixed
This
yields
Not only are factors such as
larger
than
but the
variance for the noise,
is often
replaced with that of the absolute noise,
which is
much smaller. For the split triangular distribution it goes from
When
this
means dropping from
This is not a legitimate comparison of the two
methods. We are not using the actual LPM, and method parameters need not be
identical. But it shows the impact of the different approaches taken for the
ISSN : 1492-0921
Editorial policy
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.
© Minister of Industry, 2017
Use of this publication is governed by the Statistics Canada Open Licence Agreement.
Catalogue No. 12-001-X
Frequency: semi-annual
Ottawa