2 Methodology

Stephen J. Kaputa and Katherine Jenny Thompson

Previous | Next

2.1  Decile estimation

We consider two approaches to decile estimation for continuous data: the sample decile (SD) method and interpolation. The SD method uses ordered sample weights to locate the estimate (Rao and Shao 1996). For this, the characteristics values are sorted in ascending order, and the sample weights are accumulated until they exceed the desired decile's percent of the total weight.

Interpolation methods group the continuous data in bins and interpolate over the bin containing the decile. To obtain the decile estimate ( ξ d ), MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba WaaeWaaeaacqaH+oaEdaahaaWcbeqaaiaadsgaaaaakiaawIcacaGL PaaacaGGSaaaaa@40CE@  we use the Woodruff formula (Woodruff 1952) for interpolation provided below:

ξ d = F 1 ( d N ^ )ll+( d N ^ cf f i )*(i)       ( 2.1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqOVdG3aaWbaaSqabeaacaWGKbaaaOGaeyypa0JaamOramaaCaaa leqabaGaeyOeI0IaaGymaaaakmaabmaabaGaamizaiqad6eagaqcaa GaayjkaiaawMcaaiabgIKi7kaadYgacaWGSbGaey4kaSYaaeWaaeaa daWcaaqaaiaadsgaceWGobGbaKaacqGHsislcaWGJbGaamOzaaqaai aadAgadaWgaaWcbaGaamyAaaqabaaaaaGccaGLOaGaayzkaaGaaiOk aiaacIcacaWGPbGaaiykaiaaxMaacaWLjaWaaeWaaeaaqaaaaaaaaa WdbiaaikdacaGGUaGaaGymaaWdaiaawIcacaGLPaaaaaa@5A5D@

where

F= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOraiabg2da9aaa@3D83@ cumulative frequency of the characteristic using sample weights,

ll= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiBaiaadYgacqGH9aqpaaa@3E9A@ lower limit of the bin containing the decile,

N ^ = MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmOtayaajaGaeyypa0daaa@3D9B@ estimated total number of elements in the population,

cf= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4yaiaadAgacqGH9aqpaaa@3E8B@ cumulative frequency in all intervals preceding the bin containing the sample decile,

f i = MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOzamaaBaaaleaacaWGPbaabeaakiabg2da9aaa@3EC7@ decile class frequency (estimated total number of elements in the population of the interval containing the sample decile),

i= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAaiabg2da9aaa@3DA6@ width of the bin containing the sample decile,

d= MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamizaiabg2da9aaa@3DA1@ desired decile (0.1, 0.2, 0.3, … , 0.9).

Notice that this formula does not require that each bin to be of equal length. However, it does require that the data within each bin be uniformly distributed. This later requirement poses the true challenge with a highly skewed population, especially in the upper tail.

Figure 2.1 below illustrates how to use the Woodruff method to estimate the 80th decile. The sample data have been grouped into twelve separate bins. The empirical CDF is produced from the complete set of weighted sample data (as one referee noted, the empirical CDF is extremely smooth for sample survey data; in practice, the curve would include discrete steps. The Woodruff method procedure would be the same, however). The decile estimate is located at the intersection of the empirical CDF curve and red asymptote at Y=0.80. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaaeywaiabg2da9iaaicdacaGGUaGaaGioaiaaicdacaGGUaaaaa@412F@  The 80th decile is F 1 (0.80), MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaeOramaaCaaaleqabaGaeyOeI0IaaGymaaaakiaacIcacaaIWaGa aiOlaiaaiIdacaaIWaGaaiykaiaacYcaaaa@434C@  contained in the 5th bin; the interpolated estimate of the 80th percentile would therefore be obtained by using (2.1) over the fifth bin.

Figure 2.1 below illustrates how to use the Woodruff method to estimate the 80th decile.

Description for figure 2.1

Figure 2.1 Illustration of the Woodruff method

Determining the optimal bin size for both estimation and variance estimation can be difficult. As the bins narrow (approaching width 1), then the variance estimates become more unstable. Smoothing the estimates via the interpolation reduces the instability of variance, but increases the bias in the estimate. The bias component increases as the bin widths increase.

Economic data generally have a positive skewed distribution. Moreover, the subdomains' characteristic distributions will vary, and their respective moments change over time as the economy changes. Consequently, developing a standard set of fixed bins for interpolation that work consistently over time is nearly impossible. Instead, Thompson and Sigman (2000) developed a "data-dependent� binning procedure, where the width of each bin is determined separately by the estimation cell. Their recommended method linearly transforms each characteristic to a standard scale and then uses a standard set of bins for every characteristic. The authors use the following linear transformation

X i = X i × 1,000 Q 75 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmiwayaafaWaaSbaaSqaaiaadMgaaeqaaOGaeyypa0Jaamiwamaa BaaaleaacaWGPbaabeaakiabgEna0oaalaaabaGaaGymaiaacYcaca aIWaGaaGimaiaaicdaaeaacaWGrbWaaSbaaSqaaiaaiEdacaaI1aaa beaaaaaaaa@4908@

where Q 75 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyuamaaBaaaleaacaaI3aGaaGynaaqabaaaaa@3E35@  is the 75th percentile (3rd quartile) of the sample distribution, obtained using the SD method. The interpolated-median estimate of the X MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gabmiwayaafaaaaa@3C9B@  is multiplied by ( Q 75 /1,000) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaiikaiaadgfadaWgaaWcbaGaaG4naiaaiwdaaeqaaOGaai4laiaa igdacaGGSaGaaGimaiaaicdacaaIWaGaaiykaaaa@43E4@  to obtain a value on the original scale. This procedure is equivalent to simply dividing the original sample in each estimation cell from 0 to Q 75 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyuamaaBaaaleaacaaI3aGaaGynaaqabaaaaa@3E35@  into Z MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOwaaaa@3C91@  bins of equal width and placing the remainder of the sample into one bin, which, by design, is much larger than the others. With the highly positive skewed housing data, this transformation works well for estimating the median because it is far from the Q 75 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyuamaaBaaaleaacaaI3aGaaGynaaqabaaaaa@3E35@  scaling parameter. However, it does not permit estimation of either the 80th or 90th deciles. Thus, if we wanted to continue using an interpolation method, we needed to consider alternative transformations.

The simplest approach is to use the original data-dependent bin method with a higher scaling parameter, i.e., use any percentile value larger than 90%. We use the 95th percentile as the scaling factor and hereafter refer to this method as the "P95 method�.

The P95 method does create uniform distributions within the majority of the bins but is still problematic at the upper end of the distribution for two reasons. First, the final bin contains only five percent of the sample distribution, and the values within this bin are generally very different. Second, the data-dependent binning procedure requires that each decile be "far from� the large final bin; if not, then the decile estimates exhibit the same instability as those obtained using the "SD method�. Unfortunately, the bin containing the 90th percentile is often close to the final bin when using a scaling parameter of 95%.

To address the second issue, we considered another data-binning approach, denoted as the "P75 method�. For this, we create two sets of bins per estimation cell, each with different widths above and below the cell's Q 75 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyuamaaBaaaleaacaaI3aGaaGynaaqabaaaaa@3E35@  value. This requires two separate linear transformations per estimation cell, given by

X i = X i × 1,000 X 75  when  X i < X 75 X i =( X i X 75 )× 1,000 ( X 100 X 75 )          when  X i X p  and  X 100 =maximum value in sample. MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcea qabeaaceWGybGbauaadaWgaaWcbaGaamyAaaqabaGccqGH9aqpcaWG ybWaaSbaaSqaaiaadMgaaeqaaOGaey41aq7aaSaaaeaacaaIXaGaai ilaiaaicdacaaIWaGaaGimaaqaaiaadIfadaWgaaWcbaGaaG4naiaa iwdaaeqaaaaakiaabccacaqG3bGaaeiAaiaabwgacaqGUbGaaeiiai aadIfadaWgaaWcbaGaamyAaaqabaGccqGH8aapcaWGybWaaSbaaSqa aiaaiEdacaaI1aaabeaaaOqaaiqadIfagaGbamaaBaaaleaacaWGPb aabeaakiabg2da9maabmaabaGaamiwamaaBaaaleaacaWGPbaabeaa kiabgkHiTiaadIfadaWgaaWcbaGaaG4naiaaiwdaaeqaaaGccaGLOa GaayzkaaGaey41aq7aaSaaaeaacaaIXaGaaiilaiaaicdacaaIWaGa aGimaaqaamaabmaabaGaamiwamaaBaaaleaacaaIXaGaaGimaiaaic daaeqaaOGaeyOeI0IaamiwamaaBaaaleaacaaI3aGaaGynaaqabaaa kiaawIcacaGLPaaaaaaabaGaaeiiaiaabccacaqGGaGaaeiiaiaabc cacaqGGaGaaeiiaiaabccacaqGGaGaae4DaiaabIgacaqGLbGaaeOB aiaabccacaWGybWaaSbaaSqaaiaadMgaaeqaaOGaeyyzImRaamiwam aaBaaaleaacaWGWbaabeaakiaabccacaqGHbGaaeOBaiaabsgacaqG GaGaamiwamaaBaaaleaacaaIXaGaaGimaiaaicdaaeqaaOGaeyypa0 JaaeyBaiaabggacaqG4bGaaeyAaiaab2gacaqG1bGaaeyBaiaabcca caqG2bGaaeyyaiaabYgacaqG1bGaaeyzaiaabccacaqGPbGaaeOBai aabccacaqGZbGaaeyyaiaab2gacaqGWbGaaeiBaiaabwgacaqGUaaa aaa@9965@

The X i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmiwayaafaWaaSbaaSqaaiaadMgaaeqaaaaa@3DB5@  is then placed into Z MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOwaaaa@3C92@  equal length bins, and X i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmiwayaagaWaaSbaaSqaaiaadMgaaeqaaaaa@3DB6@  into K MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4saaaa@3C83@  equal length bins, where ZK. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOwaiabgcMi5kaadUeacaGGUaaaaa@3FDB@  The interpolation is performed independently for each decile, with the appropriate inverse transformation being applied to each interpolated decile. This procedure ensures that median estimates exactly match those obtained with the current procedure.

Our third considered interpolation approach makes parametric assumptions about the characteristics. Often, economic data are approximately log-normally distributed (e.g., Steel and Fay 1995). The Normal Binning method (denoted "NB�) uses the properties of the normal distribution applied to the log-transformed data to obtain data-dependent bins. The binning technique ensures that areas of high probability have smaller bin widths to limit the amount of observations per bin and areas of low probability have larger bin width to increase the amount of observations per bin.

The NB method centers the log-transformed data around the weighted sample median, then scales the centered data by an estimate of the population standard deviation. We use the sample median because it is more outlier resistant than the sample mean. Of course, the mean and median are equivalent with normally distributed data. Given a standard normal distribution where μ=0 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiVd0Maeyypa0JaaGimaaaa@3F29@  and σ=1, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaeq4WdmNaeyypa0JaaGymaiaacYcaaaa@3FE7@  then

IQR= Q 3 Q 1 IQR=( 0.67449*σ )( -0.67449*σ )=σ( 0.67449+0.67449 )=σ*1.34898. MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcea qabeaacaWGjbGaamyuaiaadkfacqGH9aqpcaWGrbWaaSbaaSqaaiaa iodaaeqaaOGaeyOeI0IaamyuamaaBaaaleaacaaIXaaabeaaaOqaai aadMeacaWGrbGaamOuaiabg2da9maabmaabaGaaeimaiaab6cacaqG 2aGaae4naiaabsdacaqG0aGaaeyoaiaabQcacqaHdpWCaiaawIcaca GLPaaacqGHsisldaqadaqaaiaab2cacaqGWaGaaeOlaiaabAdacaqG 3aGaaeinaiaabsdacaqG5aGaaeOkaiabeo8aZbGaayjkaiaawMcaai abg2da9iabeo8aZnaabmaabaGaaeimaiaab6cacaqG2aGaae4naiaa bsdacaqG0aGaaeyoaiabgUcaRiaabcdacaqGUaGaaeOnaiaabEdaca qG0aGaaeinaiaabMdaaiaawIcacaGLPaaacqGH9aqpcqaHdpWCcaGG QaGaaGymaiaac6cacaaIZaGaaGinaiaaiIdacaaI5aGaaGioaiaac6 caaaaa@7357@

We estimated the standard deviation (sigma) as the ratio σ IQR/ 1.34898, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaeq4WdmNaeyisIS7aaSGbaeaacaWGjbGaamyuaiaadkfaaeaacaaI XaGaaiOlaiaaiodacaaI0aGaaGioaiaaiMdacaaI4aGaaiilaaaaaa a@4797@  where the IQR is obtained from the empirical CDF in the estimation cell. To normalize the data, we applied the following transformation

Y i =Log( X i ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9LqFf0x e9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9q8qi0lf9 Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamywam aaBaaaleaacaWGPbaabeaakiabg2da9iaabYeacaqGVbGaae4zamaa bmqabaGaamiwamaaBaaaleaacaWGPbaabeaaaOGaayjkaiaawMcaaa aa@42D5@

[ Y i = Y i Y med σ y = Y i Y med IQ R y / 1.34898 ] MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9LqFf0x e9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9q8qi0lf9 Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaWaamWaae aaceWGzbGbauaadaWgaaWcbaGaamyAaaqabaGccqGH9aqpdaWcaaqa aiaadMfadaWgaaWcbaGaamyAaaqabaGccqGHsislcaWGzbWaaSbaaS qaaiaab2gacaqGLbGaaeizaaqabaaakeaacqaHdpWCdaWgaaWcbaGa amyEaaqabaaaaOGaeyypa0ZaaSaaaeaacaWGzbWaaSbaaSqaaiaadM gaaeqaaOGaeyOeI0IaamywamaaBaaaleaacaqGTbGaaeyzaiaabsga aeqaaaGcbaWaaSGbaeaacaWGjbGaamyuaiaadkfadaWgaaWcbaGaam yEaaqabaaakeaacaaIXaGaaiOlaiaaiodacaaI0aGaaGioaiaaiMda caaI4aaaaaaaaiaawUfacaGLDbaaaaa@5932@

where

Y med = MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamywamaaBaaaleaacaqGTbGaaeyzaiaabsgaaeqaaOGaeyypa0da aa@408B@  log-transformed sample median over domain i, MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAaiaacYcaaaa@3D51@

IQ R y = MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamysaiaadgfacaWGsbWaaSbaaSqaaiaadMhaaeqaaOGaeyypa0da aa@4067@  log-transformed sample interquartile range over domain i. MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAaiaac6caaaa@3D53@

Again, the sample deciles and interquartile ranges are obtained via the SD method. If the data are log-normally distributed, Y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmywayaafaWaaSbaaSqaaiaadMgaaeqaaaaa@3DB6@  should have a standard normal distribution, so that roughly 68.3% of the data are within one standard deviation of the mean and 95.4% of the data are within two standard deviations of the mean. Using those properties, we split the transformed Y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmywayaafaWaaSbaaSqaaiaadMgaaeqaaaaa@3DB6@  into the five different zones and created the 45 bins shown in Table 2.1.

Table 2.1
Bins for the log-normal transformation (Normal method)

Table summary
This table displays Bins for the log-normal transformation. The information is grouped by zone (appearing as row headers), 1, 2, 3, 4, 5 (appearing as column headers).
Zone 1 2 3 4 5
Range [Low, -2) [-2, -1) [-1, 1) [1, 2) [2, High]
Percent in Zone 2.3 13.6 68.2 13,6 2.3
Bins 1 6 31 6 1
Average Percent of Sample Units per Bin 2.3 2.3 2.2 2,3 2.3

There are four different bin widths with roughly the same average percentage of sampled units per bin. Woodruff's method is applied to the transformed data to obtain the deciles and we exponentiate these decile estimates to obtain values on the original scale. Unlike the linear rescaling methods presented above, there is an additional induced estimation bias caused by the power transformation. It may have been possible to make a bias adjustment for the transformation via a Taylor expansion, as suggested by a referee, but we did not consider this approach.

2.2  Variance estimation

The MHS replication method (aka "Fay's method�) is a "compromise� between the stratified jackknife and the BRR method (Fay 1989). Rao and Shao (1999) demonstrate that the MHS variance estimator is asymptotically consistent for both smooth statistics such as ratio estimators and for non-smooth statistics such as sample quantiles estimated using the SD method outlined in 2.1. Their paper does not extend this property to interpolated decile estimates, although it does follow that these variance estimates should be consistent as the bin width approaches width 1. Like BRR, MHS replication uses a Hadamard matrix to form replicates, but uses replicate weights of 1.5 and 0.5 instead of the values of 2 and 0 used in BRR. The MHS formula for standard error estimation of any estimate θ ^ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GafqiUdeNbaKaaaaa@3D78@  is

S ^ ( θ ^ )= 4 R * r=1 R ( θ ^ r θ ^ 0 ) 2       ( 2.2 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gabm4uayaajaGaaiikaiqbeI7aXzaajaGaaiykaiabg2da9maakaaa baWaaSaaaeaacaaI0aaabaGaamOuaaaacaGGQaWaaabCaeaacaGGOa GafqiUdeNbaKaadaWgaaWcbaGaamOCaaqabaGccqGHsislcuaH4oqC gaqcamaaBaaaleaacaaIWaaabeaakiaacMcadaahaaWcbeqaaiaaik daaaaabaGaamOCaiabg2da9iaaigdaaeaacaWGsbaaniabggHiLdaa leqaaOGaaCzcaiaaxMaadaqadaqaaabaaaaaaaaapeGaaGOmaiaac6 cacaaIYaaapaGaayjkaiaawMcaaaaa@56FB@

where θ ^ r MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GafqiUdeNbaKaadaWgaaWcbaGaamOCaaqabaaaaa@3E9B@  is the r th MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOCamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3EB9@  replicate estimate (r=1,2,,R) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaiikaiaadkhacqGH9aqpcaaIXaGaaiilaiaaikdacaGGSaGaeSOj GSKaaiilaiaadkfacaGGPaaaaa@4489@  and θ ^ 0 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GafqiUdeNbaKaadaWgaaWcbaGaaGimaaqabaaaaa@3E5E@  is the full sample estimate. The sum of squared error term is adjusted by a factor of 4=1/ ( 10.5 ) 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaGinaiabg2da9maalyaabaGaaGymaaqaamaabmaabaGaaGymaiab gkHiTiaaicdacaGGUaGaaGynaaGaayjkaiaawMcaamaaCaaaleqaba GaaGOmaaaaaaaaaa@448C@  to prevent negative bias in the variance estimate (Judkins 1990).

Previous | Next

Date modified: