Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

- Raw data
- Ungrouped frequency distribution
- Grouped frequency distribution
- Stem and leaf plots
- Comparing the mean and median

If observations of a variable are ordered by value, the median value corresponds to the middle observation in that ordered list. The median value corresponds to a cumulative percentage of 50% (i.e., 50% of the values are below the median and 50% of the values are above the median). The position of the median is

**{(n + 1) ÷ 2} ^{th} value**, where

In order to calculate the median, the data must first be ranked (sorted in ascending order). The median is the number in the middle.

**Median = the middle value of a set of ordered data.**

The median is usually calculated for numeric variables, but may also be calculated for categorical variables that are sequenced, such as the categories in a satisfaction survey: excellent, good, satisfactory and poor. These qualitative categories can be ranked in order, and thus, are considered ordinal.

In raw data, the median is the point at which exactly half of the data are above and half below. These halves meet at the median position. If the number of observations is odd, the median fits perfectly and the depth of the median position will be a whole number. If the number of observations is even, the depth of the median position will include a decimal. You need to find the midpoint between the numbers on either side of the median position.

Imagine that a top running athlete in a typical 200-metre training session runs in the following times:

26.1, 25.6, 25.7, 25.2 et 25.0 seconds.

How would you calculate his median time?

First, the values are put in ascending order: 25.0, 25.2, 25.6, 25.7, 26.1. Then, using the following formula, figure out which value is the middle value. Remember that n represents the number of values in the data set.

**Median = {(n + 1) ÷ 2} ^{th} value**

= (5 + 1) ÷ 2

= 3

The third value in the data set will be the median. Since 25.6 is the third value, 25.6 seconds would be the median time.

**= 25.6 secondes**

Now, if the runner sprints the sixth 200-metre race in 24.7 seconds, what is the median value now?

Again, you first put the data in ascending order: 24.7, 25.0, 25.2, 25.6, 25.7, 26.1. Then, you use the same formula to calculate the median time.

**Median = {(n + 1) ÷ 2} ^{th} value**

= (6 + 1) ÷ 2

= 7 ÷ 2

= 3,5

Since there is an even number of observations in this data set, there is no longer a distinct middle value. The median is the 3.5^{th} value in the data set meaning that it lies between the third and fourth values. Thus, the median is calculated by averaging the two middle values of 25.2 and 25.6. Use the formula below to get the average value.

**Average = (value below median + value above median) ÷ 2**

= (third value + fourth value) ÷ 2

= (25.2 + 25.6) ÷ 2

= 50.8 ÷ 2

= 25.4

The value 25.4 falls directly between the third and fourth values in this data set, so 25.4 seconds would be the median time.

In order to find the median using cumulative frequencies (or the number of observations that lie above or below a particular value in a data set), you must calculate the first value with a cumulative frequency greater than or equal to the median. If the median's value is exactly 0.5 more than the cumulative frequency of the previous value, then the median is the midpoint between the two values.

Imagine that your school baseball team scores the following number of home runs in 10 games:

4, 5, 8, 5, 7, 8, 9, 8, 8, 7

If you were to place the total home runs in a frequency table, what would the median be?

First, put the scores in ascending order:

4, 5, 5, 7, 7, 8, 8, 8, 8, 9

Then, make a table with two columns. Label the first column "Number of home runs" and then list the possible number of home runs the team could get. You can start from 0 and list up until the number 10, but since the team never scored less than 4 home runs, you may wish to start listing at the number 4.

Label the second column "Frequency." In this column, record the number of times 4 home runs were scored, 5 home runs were scored and so on. In this case, there was only one time that 4 home runs were scored, but two times that 5 home runs were scored. If you add all of the numbers in the Frequency column, they should equal 10 (for the 10 games played).

Number of home runs (x) | Frequency (f) |
---|---|

4 | 1 |

5 | 2 |

6 | 0 |

7 | 2 |

8 | 4 |

9 | 1 |

To find the median, again use the same formula:

**Median = {(n + 1) ÷ 2} ^{th} value**

= (10 + 1) ÷ 2

= 11 ÷ 2

= 5.5

= the median is the 5.5

To get the median, add up the numbers in the Frequency column until you get to 5 (and since the total number of games is 10, the remaining numbers in that column should also equal 5). You will reach 5 after adding all of the frequencies up to and including those for the 7 home runs. The next set of five will begin with the frequencies for 8 home runs. The median (the 5.5^{th} value) lies between the fifth value and the sixth value. Thus, the median lies between 7 home runs and 8 home runs.

If you calculate the average of these values (using the same formula used in Example 2), the result is 7.5.

**Average = (middle value before + middle value after) ÷ 2**

= (fifth value + sixth value) ÷ 2

= (7 + 8) ÷ 2

= 15 ÷ 2

= 7.5

Technically, the median should be a possible variable. In the above example, the variables are discrete and always whole numbers. Therefore, 7.5 is not a possible variable—no one can hit 7 and a half home runs. Thus, this number only makes sense statistically. Some mathematicians may argue that 8 is a more appropriate median.

Sometimes it does not make sense to list each individual variable when a frequency distribution table would be long and cumbersome to work with. In order to simplify this, divide the range of data into intervals and then list the intervals in a frequency distribution table, including a column for the cumulative percentage. (For more information, refer to the Cumulative frequency section.)

The calculation to find the median is a little longer because the data have been grouped into intervals and, therefore, all of the original information has been lost. Some textbooks simply take the midpoint of the interval as the median. However, that method is an over-simplification of the true value. Use the following calculations to find the median for a grouped frequency distribution.

- Figure out which interval contains the median by using the
**(n + 1) ÷ 2**formula. Take whatever value the calculation gives you and then add up the numbers in the frequency column until you come to that value (just like Example 3). For example, if your median is the 13.5^{th}value, add up the frequencies until you come to the 13^{th}and 14^{th}values. Whichever interval contains these values is called the median group. - Find the cumulative percentage of the interval preceding the median group. Label this value
**A**. - Using this cumulative percentage, calculate how many numbers are needed in order to add up to 50% of the total cumulative percentage. This value will be labeled
**B**. Use the following formula to calculate**B**:

**B = 50 - A**

- Figure out the range (how many numbers the interval covers). Call this value
**C**. Then, find the percentage for the median interval. Call this value**D**. - Calculate how many data values you have to count in the median group to get 50% of the total data set by using the following formula. Call this value
**E**.

**E = (B ÷ D) x C**

- Find out what the median value is by adding the value for E to the lower value of the median interval:

**Median = lower value + E**

Since**E = (B ÷ D) x C**, this formula can also be written as:

**Median = lower value + (B ÷ D) x C**

If the cumulative frequency for an interval is exactly 50%, then the median value would be the endpoint of this interval.

Let's make this clear with an example!

Using the same information from Example 4 in the Mean section, imagine that you surveyed 50 Grade 10 girls to find out how tall each one is in centimetres. After gathering all of your data, you created a frequency distribution table that looked like this:

Height (cm) | Frequency (f) | Endpoint (x) | Cumulative frequency | Percentage | Cumulative percentage |
---|---|---|---|---|---|

150 to < 155 | 4 | 155 | 4 | 8 | 8 |

155 to < 160 | 7 | 160 | 11 | 14 | 22 |

160 to < 165 | 18 | 165 | 29 | 36 | 58 |

165 to < 170 | 11 | 170 | 40 | 22 | 80 |

170 to < 175 | 6 | 175 | 46 | 12 | 92 |

175 to < 180 | 4 | 180 | 50 | 8 | 100 |

Using the grouped data, you created a cumulative frequency graph to accompany your table. The endpoints of the height intervals, the numbers for cumulative frequency and the numbers for cumulative percentage have been plotted on the graph.

By just looking at the graph, you can try to find the median value. The median is the point where the x-axis (Height) intersects with the midpoint (25) of the y-axis (Cumulative frequency). You will see that the median value is approximately 164 cm. Using mathematical calculations, you can find out that the value is actually 163.9 cm. Here's how:

- According to the information provided in Table 2:

**Median = {(n + 1) ÷ 2}**^{th}value

= (50 + 1) ÷ 2

= 51 ÷ 2

= 25.5

By adding up the frequencies, we find that the median (25.5) lies in the median group of the 160 to < 165 cm interval. - The cumulative percentage of the preceding interval (
**A**) is 22. - The percentage needed in order to get 50% of the total cumulative percentage (
**B**) is 28.

**B = 50 - A**

= 50 - 22

= 28 - The range of the median interval (
**C**) is 5 and the percentage for the median interval (**D**) is 36. - The number of values to count down within the interval in order to get to 50% of the total data set is 3.9.

**E = (B ÷ D) x C**

= (28 ÷ 36) x 5

**= 3.9** - Since the lower value of the median interval is 160, when you add the value of
**E**to that you get a median of 163.9 cm.

**Median = lower value of median interval + (B ÷ D) x C**

= 160 + (28 ÷ 36) x 5

= 160 + 3.9

**= 163.9 cm**

Ordered stem and leaf plots make it simple to calculate the median, particularly if the cumulative frequencies have already been calculated. Consider the heights of 50 Grade 10 girls using a stem and leaf plot. (See the Organizing data chapter for more information on how to construct these tables.)

Stem* (cm) | Leaf | Cumulative frequency |
---|---|---|

15^{(0)} |
0 1 1 4 | 4 |

15^{(5)} |
5 6 7 7 8 8 8 | 11 |

16^{(0)} |
0 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 4 4 | 29 |

16^{(5)} |
5 5 5 5 6 6 6 7 7 8 9 | 40 |

17^{(0)} |
0 0 1 2 3 3 | 46 |

17^{(5)} |
6 6 7 8 | 50 |

***Note:** The stems have been split into smaller intervals. Stem 15^{(0)} means that all the data fall within the interval 150 to 154. Stem 15^{(5)} means that the data are in the interval of 155 to 159.

There are 50 pieces of data, so the median is the value of the 25.5^{th} observation.

**Median = {(n + 1) ÷ 2} ^{th} value**

= (50 + 1) ÷ 2

= 51 ÷ 2

= 25.5

Therefore, the median lies between the 25^{th} and 26^{th} values. To find out what these values are, count each value in the Leaf column until you have reached the 25^{th} and 26^{th} values. These values lie in the 16(0) interval, meaning the 160 to 164 interval. The numbers in the leaf column represent the numbers in the interval (e.g., 3 represents 163). Thus, the median lies between 163 cm (25^{th} value) and 164 cm (26^{th} value). The median is found by averaging these two values.

**Average = (value before median + value after median) ÷ 2**

= (25^{th} value + 26^{th} value) ÷ 2

= (163 + 164) ÷ 2

= 327 ÷ 2

= 163.5

Since height is a continuous variable, 163.5 cm is an acceptable median value.

The median obtained from the cumulative frequency graph (164 cm) is not the same value as the median obtained from the calculation in Example 4 (163.9 cm) or from the stem and leaf plot (163.5 cm). This is because you can only find an approximation for the median, unless the graph is drawn precisely with all the information used.

The calculations in Example 4 are only approximations, since grouped data do not tell you how the 36% of the 50 girls found in the median interval are distributed within the interval. As a result, we make the assumption that they are uniformly distributed, and this may lead to a slightly different median. However, the stem and leaf plot is the most accurate means of obtaining a median because it uses all of the actual values.

It is possible for the mean and median of a distribution to have the same value. This is always the case if distribution is symmetrical as in a normal distribution. If the distribution is roughly symmetrical, then the two values will be close together.

In the example of the heights of the 50 Grade 10 girls, the mean (164.5 cm) is very close to the value of the median (163.5 cm). This is because the distribution is roughly symmetrical (see the stem and leaf plot in the above example).

However, one number can alter the mean without affecting the median.

Consider the following sets of data that represent the number of points scored by 3 players in 11 lacrosse games.

Eileen: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3

Mean = 22 ÷ 11 = **2**

Median = **2**

Jeremy: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4

Mean = 23 ÷ 11 = **2.1**

Median = **2**

Randy: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 14

Mean = 33 ÷ 11 = **3**

Median = **2**

The three sets of data above are identical except for the last observation values (3, 4 and 14).

The median does not alter because it is only dependent on the middle observation's value. The mean does change, however, because it is dependent on the average value of all observations. So, in the above example, as the last value of the last observation increases, so too does the mean.

In the third data set, the value of 14 is very different from any other values. When an observation is very different from all other observations in a data set, it is called an outlier. (For more information on outliers, see the Stem and leaf plots section.) The mean is the measure of central tendency most affected by outliers.

Outliers can sometimes occur as a result of error or deliberate misinformation. In these cases, the outliers should be excluded from the measure of central tendency. Other times, outliers just show how different one value is, and this can be a very useful piece of data.

When house prices are referred to in newspapers, generally the median price is quoted. Why is this measure used instead of the mean?

There are many moderately priced houses, but there are also some expensive ones and a few very expensive ones. The mean figure could be quite high as it includes the prices of the more expensive houses. But the median gives a more accurate and realistic value of the prices faced by most people.

In summary, the median is the central number and is good to use in skewed (or unbalanced) distributions because it is not affected by outliers.

Suppose you want to know how much money a family could afford to spend on housing. This would depend on the total family income.

For a family of five (two parents who work and three children with no income) the mean income of each family member is the total income divided by five (e.g., 60,000 ÷ 5 = 12,000). However, the median income would be zero, because more than half of the members of the family make nothing. In some situations, the mean can be much more informative than the median.

If you want to find out whether a country is wealthy or not, you might consider using the median as your measure of central tendency instead of the mean.

The mean family income could be quite high if income is highly concentrated in a few very wealthy families (despite the fact that most families might earn essentially nothing). Thus, the median family income would be a more meaningful measure—at least half the families would earn the median income or less, and at least half would earn at least as much as the median income or more.

Suppose you are applying for a job as an accountant at several large firms, and you want to get an idea of how much money you could expect to be making in five years if you join a particular firm. You may want to consider the salaries of accountants in each firm five years after they are hired.

One very high salary could make the mean salary higher; that might not reflect a typical salary within these firms. However, half the accountants make the median salary or less, and half make the median salary or more. So, the measure of central tendency that would give you a better idea of a typical salary would be the median.

By choosing a measure of central tendency favourable to your point of view, you can mislead people with statistics. In fact, this is commonly done.

Imagine you are the owner of a bakery that makes and sells individual birthday cakes and huge wedding cakes.

It might be in your interest to claim to your customers that the prices have been lowered, and to claim to your shareholders that you have raised the prices. Suppose that last year you sold 100,000 birthday cakes at $10 each, and 1,000 wedding cakes at $1,000 each. This year, you sold 100,000 birthday cakes at $8 each and 1,000 wedding cakes at $1,200 each.

- The median price of the 101,000 cakes sold last year is $10, because more than half of the items sold were birthday cakes. The median price of the 101,000 cakes sold this year is $8.
- The mean price of the 101,000 cakes sold last year is $19.80.

**(100,000 x $10 + 1,000 x $1,000) ÷ 101,000 = $19.80** - The mean price of the 101,000 cakes sold this year is also $19.80.

**(100,000 x $8 + 1,000 x $1,200) ÷ 101,000 = $19.80**

The average price per cake sold is the same in both years. Also, the total revenue and the number of the cakes sold was the same. The idea is that you can make data appear to tell conflicting stories by choosing the appropriate measure of central tendency.

It is important to note that you do not have to use only one measure of central tendency. The mean and median can both be used, thus providing more information about the data.