A layered perturbation method for the protection of tabular outputs
Section 3. The Layered Perturbation Method (LPM)

3.1 Description

The LPM is a perturbative method for totals that focuses on disclosure from differencing. When used in tables of magnitude it allows cell suppression to be restricted to sensitive cells. Three basic ideas underlie the LPM. The first two are similar to the TCM approach.

The first basic idea is the attachment of pseudo-random hash numbers (PRNs) to units to produce consistent perturbation outcomes when needed. This discourages the use of repeated queries to improve the estimation of unperturbed totals. The EZS method is used to multiply the value of a unit $i$ by a weight $w_{i} = 1 + ε_{i},$ with $ε_{i} ~ (0, σ_{ε}^{2})$ as above. To obtain consistent results $ε_{i}$ are generated from a unit-specific PRN that is uniformly distributed over $[0, 1) .$ For example, use $h_{i} / 1000,$ where $h_{i}$ are generated from the Social Insurance Number (e.g., $h_{i} = M o d (S I N_{i} \cdot P, 1000)$ for $P$ a large prime). Using $h_{i}$ will always perturb unit $i$ the same way. To perturb unit $i$ the same way only when it appears in the same cell total, generate cell-unit level noise $w_{i'} = 1 + ε_{i'}$ from $h_{i'} = M o d (h_{i} + h_{t o t}, 1000) / 1000,$ where $h_{t o t} = \sum_{i \in c e l l} h_{i} .$ Primes are used to designate cell-unit specific noises and perturbations. All noise values are derived from $h_{i}$ or $h_{i'} .$

The second idea is the application of perturbation to units in each cell by layers. The largest four units are perturbed in a random but consistent manner using perturbation weights $w_{i}$ generated from $h_{i} .$ The next largest units, say units 5 to 9, are perturbed in a semi-consistent manner. Their perturbation is a mixture of unit specific weights $w_{i}$ and unit-cell specific weights $w_{i'} .$ Smallest units are not perturbed. Their values are protected from differencing by the unit-cell perturbations of units 5 to 9 since adding or removing a unit in a cell, no matter how small, will affect the $w_{i'}$ for those units. The number of units per layer is flexible, we have found that four and five, respectively gave satisfactory results.

A third set of measures mostly targets the issue of differencing. The direction of noise for even-ranked units is reversed $(w_{i}$ are set from ${(- 1)}^{i + 1} ε_{i})$ to increase variances of differences when a top-ranked unit is changed. For units 5 to 9 a random mixture of $w_{i}$ and $w_{i'}$ is applied to lessen the risk when a small unit is added or removed. Finally, the noise for the top three units is amplified in nonsensitive cells with greater dominance. This allows lower levels of noise to be used generally, reducing the overall impact of the perturbation on data quality.

A suggested application of the LPM would consist of suppressing all sensitive and small cells (e.g., $n < 10)$ and perturbing remaining cells. Because of the protection offered by perturbation, cells that are slightly sensitive may also be publishable. For other cells with cell total $X = \sum_{i \in c e l l} x_{i},$ set perturbed value $Z$ as

$Z = X + K ε_{1} x_{1} - L ε_{2} x_{2} + M ε_{3} x_{3} - ε_{4} x_{4} - \sum_{i = 5}^{9} {{(- 1)}^{i} α_{i} ε_{i} - (1 - α_{i}) ε_{i'}} x_{i} .$

$K, L$ and $M$ are set to increase the noise of $Z,$ when needed (set $K, L$ and $M \geq 1) .$ The $α_{i}$ are random variables that are independent of $ε_{i},$ e.g., $α_{i} ~ Uniform(0, 1)$ or $α_{i} = Mod (h_{i}, 8) / 7 .$

3.2 Some results

Let $ε_{i}, ε_{i'} ~ (0, σ_{ε}^{2}),$ $α_{i} ~ Uniform(0, 1),$ i.i.d. and let $K, L$ and $M$ be fixed (for now). It follows that:

$E (Z) = X and V (Z) = {K^{2} x_{1}^{2} + L^{2} x_{2}^{2} + M^{2} x_{3}^{2} + x_{4}^{2} + \frac{2}{3} \sum_{i = 5}^{9} x_{i}^{2}} σ_{ε}^{2} .$

Let $X_{- 1}, X_{- 2}, X_{- 3}$ and $Z_{- 1}, Z_{- 2}, Z_{- 3}$ equal $X$ and $Z$ for the cell after removing units 1, 2 and 3, respectively. Keeping subscripts from the original cell (i.e., subscript 2 refers to the unit that was second in $X)$ we have:

$\begin{array}{l} Z_{- 1} & = X_{- 1} + K ε_{2} x_{2} - L ε_{3} x_{3} + M ε_{4} x_{4} - ε_{5} x_{5} - \sum_{i = 6}^{10} {{(- 1)}^{i} α_{i} ε_{i} - (1 - α_{i}) ε_{i'}} x_{i}, \\ Z_{- 2} & = X_{- 2} + K ε_{1} x_{1} - L ε_{3} x_{3} + M ε_{4} x_{4} - ε_{5} x_{5} - \sum_{i = 6}^{10} {{(- 1)}^{i} α_{i} ε_{i} - (1 - α_{i}) ε_{i'}} x_{i}, and \\ Z_{- 3} & = X_{- 3} + K ε_{1} x_{1} - L ε_{2} x_{2} + M ε_{4} x_{4} - ε_{5} x_{5} - \sum_{i = 6}^{10} {{(- 1)}^{i} α_{i} ε_{i} - (1 - α_{i}) ε_{i'}} x_{i} . \end{array}$

We can obtain $Z_{- i}$ for other units similarly. If we estimate the dropped units as ${\hat{x}}_{i} = Z - Z_{- i}$ it can be shown that, with $G = 2 \frac{2}{3} x_{5}^{2} + 2 \sum_{i = 6}^{9} x_{i}^{2} + \frac{2}{3} x_{10}^{2},$

$\begin{array}{l} E ({\hat{x}}_{i}) & = x_{i}, \\ V ({\hat{x}}_{1}) & = {K^{2} x_{1}^{2} + {(K + L)}^{2} x_{2}^{2} + {(L + M)}^{2} x_{3}^{2} + {(M + 1)}^{2} x_{4}^{2} + G} σ_{ε}^{2}, \\ V ({\hat{x}}_{2}) & = {L^{2} x_{2}^{2} + {(L + M)}^{2} x_{3}^{2} + {(M + 1)}^{2} x_{4}^{2} + G} σ_{ε}^{2}, \\ V ({\hat{x}}_{3}) & = {M^{2} x_{3}^{2} + {(M + 1)}^{2} x_{4}^{2} + G} σ_{ε}^{2}, \\ V ({\hat{x}}_{4}) & = {x_{4}^{2} + G} σ_{ε}^{2}, \\ V ({\hat{x}}_{5}) & = {\frac{2}{3} x_{5}^{2} + 2 x_{6}^{2} + 2 x_{7}^{2} + 2 x_{8}^{2} + 2 x_{9}^{2} + \frac{2}{3} x_{10}^{2}} σ_{ε}^{2}, \\ V ({\hat{x}}_{6}) & = {\frac{2}{3} x_{5}^{2} + \frac{2}{3} x_{6}^{2} + 2 x_{7}^{2} + 2 x_{8}^{2} + 2 x_{9}^{2} + \frac{2}{3} x_{10}^{2}} σ_{ε}^{2}, \\ V ({\hat{x}}_{7}) & = {\frac{2}{3} x_{5}^{2} + \frac{2}{3} x_{6}^{2} + \frac{2}{3} x_{7}^{2} + 2 x_{8}^{2} + 2 x_{9}^{2} + \frac{2}{3} x_{10}^{2}} σ_{ε}^{2}, \\ V ({\hat{x}}_{8}) & = {\frac{2}{3} x_{5}^{2} + \frac{2}{3} x_{6}^{2} + \frac{2}{3} x_{7}^{2} + \frac{2}{3} x_{8}^{2} + 2 x_{9}^{2} + \frac{2}{3} x_{10}^{2}} σ_{ε}^{2}, \\ V ({\hat{x}}_{9}) & = \frac{2}{3} {x_{5}^{2} + x_{6}^{2} + x_{7}^{2} + x_{8}^{2} + x_{9}^{2} + x_{10}^{2}} σ_{ε}^{2}, and \\ V ({\hat{x}}_{i}) & = \frac{2}{3} {x_{5}^{2} + x_{6}^{2} + x_{7}^{2} + x_{8}^{2} + x_{9}^{2}} σ_{ε}^{2}, for i > 9. \end{array}$

If we assume that $K, L$ and $M$ are fixed we can set them based on some requirement for $V ({\hat{x}}_{i}) .$ For example, we may want to have $V ({\hat{x}}_{i}) = x_{i}^{2} / 30$ since, for $z ~ N (0, 1), \Pr (| z | > 0 .44) = 0 .66$ which for ${\hat{x}}_{i} ~ N (x_{i}, x_{i}^{2} / 30)$ gives $\Pr {| {\hat{x}}_{i} - x_{i} | \geq 8 % x_{i}} = 66 % .$

To obtain $V ({\hat{x}}_{i}) = x_{i}^{2} / N N$ we can solve (fixed) $K, L$ and $M$ in reverse order. This gives

$\begin{array}{l} M & = \frac{\sqrt{(x_{3}^{2} + x_{4}^{2}) (x_{3}^{2} / N N σ_{ε}^{2} - G) - x_{3}^{2} x_{4}^{2}} - x_{4}^{2}}{x_{3}^{2} + x_{4}^{2}} \\ L & = \frac{\sqrt{(x_{2}^{2} + x_{3}^{2}) (x_{2}^{2} / N N σ_{ε}^{2} - G - x_{4}^{2} (M + 1) ²) - M ² x_{2}^{2} x_{3}^{2}} - M x_{3}^{2}}{x_{2}^{2} + x_{3}^{2}} \\ K & = \frac{\sqrt{(x_{1}^{2} + x_{2}^{2}) (x_{1}^{2} / N N σ_{ε}^{2} - G - x_{3}^{2} (L + M) ² - x_{4}^{2} (M + 1) ²) - L ² x_{1}^{2} x_{2}^{2}} - L x_{2}^{2}}{x_{1}^{2} + x_{2}^{2}} \end{array}$

In practice, $L$ and $M$ are bounded below at 1 and above at some threshold value less than 2, and $K$ is bounded below at 1 and can taper off above the threshold. Also, the target values of $K, L$ and $M$ depend on the situation in each cell. Here, for simplicity of illustration, they were assumed not to change when we removed observations from the cell.

Using the same noise and changing its direction for even-ranked units means that we take advantage of the correlation between the $Z$ and $Z_{- i}$ to increase the variance of ${\hat{x}}_{i} = Z - Z_{- i} .$ For example, the contribution to $V ({\hat{x}}_{1})$ from unit 2 is ${(K + L)}^{2} x_{2}^{2} σ_{ε}^{2} .$ If we had used independent (or unit-cell specific) noises $ε_{i'}$ instead of $ε_{i}$ for units 1 to 4 the contribution from unit 2 would have been only $(K^{2} + L^{2}) x_{2}^{2} σ_{ε}^{2} .$

3.3 Comparison with the EZS and TCM approaches

With EZS the perturbed cell total is simply $Z = X + \sum_{i \in c e l l} ε_{i} x_{i},$ giving $V (Z) = \sum_{i \in c e l l} x_{i}^{2} σ_{ε}^{2} .$ For any unit $i$ we have $E ({\hat{x}}_{i}) = x_{i}$ and $V ({\hat{x}}_{i}) = x_{i}^{2} σ_{ε}^{2},$ which is smaller than the equivalent variance with the LPM for the same level of noise $σ_{ε}^{2}$ even when we set $K = L = M = 1.$ A possible exception could be unit 5, if subsequent units are relatively quite small. This can be seen by examining $V ({\hat{x}}_{5})$ above.

The TCM applies three multiplicative perturbation factors to the largest, say 4, units in each cell. A magnitude component $M_{i}$ determines the relative size of the perturbation for the $i^{th}$ ranked unit. The $M_{i}$ are fixed; typically $M_{1} > M_{2} > M_{3} > M_{4},$ e.g., $[0 .6, 0 .4, 0 .3, 0 .2] .$ A permanent random factor $d_{i} = \pm 1$ fixes the direction of the noise for each unit $i .$ A pseudo-random factor $s_{i} > 0$ determines unit-cell specific noises. This gives $Z = X + \sum_{i = 1}^{4} M_{i} d_{i} s_{i} x_{i} .$ The method can be represented in a form comparable to LPM, with $[M_{1}, M_{2}, M_{3}, M_{4}] = [K, L, M, 1], d_{i} = sign (ε_{i})$ and $s_{i} = | ε_{i'} | .$ The way the $d_{i}$ are fixed is a major difference with the LPM that greatly diminishes the protection offered to ${\hat{x}}_{1} .$ To illustrate this, consider two adaptions of these methods that yield identical variances for $Z :$

$\begin{array}{l} Z_{L P M} & = X + K ε_{1} x_{1} - L ε_{2} x_{2} + M ε_{3} x_{3} - ε_{4} x_{4}, and \\ Z_{T C M} & = X + K s i g n (ε_{1}) | ε_{1'} | x_{1} + L s i g n (ε_{2}) | ε_{2'} | x_{2} + M s i g n (ε_{3}) | ε_{3'} | x_{3} + s i g n (ε_{4}) | ε_{4'} | x_{4}, \end{array}$

where the same notational conventions as before are used, with fixed $K, L, M > 0.$ This yields

$\begin{array}{l} V_{L P M} ({\hat{x}}_{1}) & = {K^{2} x_{1}^{2} + {(K + L)}^{2} x_{2}^{2} + {(L + M)}^{2} x_{3}^{2} + {(M + 1)}^{2} x_{4}^{2} + x_{5}^{2}} σ_{ε}^{2}, and \\ V_{T C M} ({\hat{x}}_{1}) & = K^{2} x_{1}^{2} σ_{ε}^{2} + {(K^{2} + L^{2}) x_{2}^{2} + (L^{2} + M^{2}) x_{3}^{2} + (M^{2} + 1) x_{4}^{2}} σ_{| ε |}^{2} + x_{5}^{2} σ_{ε}^{2} . \end{array}$

Not only are factors such as ${(K + L)}^{2}$ larger than $(K^{2} + L^{2}),$ but the variance for the noise, $σ_{ε}^{2},$ is often replaced with that of the absolute noise, $σ_{| ε |}^{2},$ which is much smaller. For the split triangular distribution it goes from $(3 a^{2} + 2 a b + b^{2}) / 6 to {(b - a)}^{2} / 18 .$ When $b = 2 a$ this means dropping from $11 a^{2} / 6 to a^{2} / 18 .$

This is not a legitimate comparison of the two methods. We are not using the actual LPM, and method parameters need not be identical. But it shows the impact of the different approaches taken for the $d_{i} .$

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2017-06-22

Language selection

Search and menus

Search

A layered perturbation method for the protection of tabular outputs
Section 3. The Layered Perturbation Method (LPM)

3.1 Description

3.2 Some results

3.3 Comparison with the EZS and TCM approaches

A layered perturbation method for the protection of tabular outputs Section 3. The Layered Perturbation Method (LPM)

3.1 Description

3.2 Some results

3.3 Comparison with the EZS and TCM approaches

Editorial policy

Submission of Manuscripts

Note of appreciation

Standards of service to the public

Copyright

A layered perturbation method for the protection of tabular outputs
Section 3. The Layered Perturbation Method (LPM)