Predict and investigate
Types of data
Types of graphs
Measures of central tendency: Mean, median and mode
As discussed in the introduction, a statistical investigation sometimes starts with a student asking a question that leads to the gathering of the necessary primary data. At other times, a secondary set of data is available that can lead the student to pose interesting questions.
In either case, students make better sense of the data if they are led to think about possible answers before, during and after the data collection. Students must learn to predict and investigate.
Once a question has been posed, students must begin to think about how they can find the answer to it. A good starting place is to ask students to predict or hypothesize what they think the answer will be, and then see how that leads to the investigation of the problem.
Let's consider the following research question: Does the amount of time spent playing computer games affect school grades?
Students may initially respond 'no' or 'yes' to this question. By asking them to explain their answers, you can encourage them to think about the conditions they would attach to their answer.
For example, they may suggest that the answer depends on the age or gender of students. Perhaps they'll think the number of hours spent on homework has an effect, as well as the time spent playing computer games. These ideas, in turn, may suggest the need to include questions about homework as well as computer games in their survey. Examining their initial predictions will lead them to the steps of investigation.
Additionally, when you ask students to predict an answer before they investigate the question, you help them to notice and correct any misconceptions as they are collecting their data.
Particular questions produce particular types of data, which in turn lend themselves to particular types of graphs.
There are two main types of data: categorical and numeric.
The question 'What colour is your hair?' produces categorical data, which fit into categories 'brown,' 'blonde,' 'black,' 'red' or 'other.' Categorical data can be broken down into nominal and ordinal sub-types.
See the table below for each categorical sub-type and its associated graph types.
Types of data |
Sub-types |
Examples from Census at School database |
Appropriate graphs |
|---|---|---|---|
| Categorical: Data fit into various categories of responses to a question. | Nominal: These data are identified by particular names or categories. These data cannot be organized according to any 'natural' order. | Gender: male, female | Bar graph, circle graph, pictograph |
| Favourite subject: math, history, gym, music, etc. | |||
| Eye colour: brown, blue, green, other | |||
| Pets: cats, dogs, birds, fish, etc. | |||
| Ordinal: These data are identified by categories that can be placed in a specific order or ordered in some 'natural way.' | Schoolwork pressure: none, very little, some, a lot | Bar graph, circle graph, pictograph |
The question, 'How many people live in your home?' produces numeric data, which can be broken down into discrete and continuous sub-types.
See the chart below for each sub-type and its associated graph types.
Types of data |
Sub-types |
Examples from Census at School database |
Appropriate graphs |
|---|---|---|---|
| Numeric: Data are represented by real numbers. Also known as quantitative data. | Discrete: Data that can only assume a finite number of different responses. For example, the numbers of people in a household are discrete data because you can only answer using whole numbers from 1 to 10 or more. You cannot include all the decimals or fractions in between as possible answers. For example, it's impossible to have 2.5 or 3.75 people. | Age in years: 7, 8, 9, 10, 11, etc. | Bar graph, line graph, circle graph, histogram |
| Number of people in the household: 1, 2, 3, 4, 5, etc. | |||
| Number of days during which you did an intense physical activity last week: 0, 1, 2, 3, 4, 5, 6, 7, etc. | |||
| Note: Sometimes numbers can represent scales of response (e.g., 0=none, 1=very little, 2=some, etc.). In this case, the responses are considered ordinal categorical data, not numeric data, even though they are represented by a number. | Continuous: Data that can assume an infinite number of different responses. The answers have infinite possibilities since they can include decimal responses. For example, a student's height may be 1.57923 metres. | Height, arm span, wrist circumference: It's impossible to list all the possibilities. Note: In the Census at School survey, students are required to round their answers to the nearest centimetre or millimetre, so in effect their responses are discrete data. | Line graph, histogram |
Notes: To make continuous data easier to handle, they are often grouped into class intervals. Grouping data is part of the process of organizing data so that the information becomes useful. For example, instead of displaying every height measured in a class of students, it is more effective to display grouped categories such as 120 to 129 cm, 130 to139 cm, 140 to 149 cm, etc. Discrete data may be grouped or ungrouped. Grouping data makes them easier to handle, but with a small number of responses, it can be just as clear to leave them ungrouped. |
|||
Bar graphs can present either categorical or numeric data. Numeric data are either ungrouped (if they include few numbers) or grouped into class intervals.
Bar graphs consist of an axis with labelled horizontal or vertical bars. Those with vertical bars are also called column graphs. The bars depict the frequencies of different responses. The numbers on the x-axis of a horizontal bar graph or the y-axis of a vertical bar graph are called the scale.
When developing bar graphs, each category or value is represented by a vertical or horizontal bar. The height or length of the bar will represent the number of units or observations in that category (i.e. their frequency).
Three-dimensional bar graphs should be avoided because the added depth dimension makes it more difficult to read the data accurately.


Line graphs compare two variables: one is plotted along the x-axis (horizontal) and the other along the y-axis (vertical). The graph shows how the variables are related or vary with each other by drawing a continuous line between all the points.

Line graphs are also used to reveal trends over time. While bar graphs reveal a change in magnitude, line graphs show a change in direction. Line graphs are popular for showing data over time because they reveal data trends clearly and are easy to create.
When a line graph is showing a trend over time, the y-axis usually indicates quantity (e.g., dollars, litres) or percentage, while the horizontal x-axis measures units of time.

A circle graph or pie chart is a way of summarizing a set of categorical data or displaying the different values of a given variable (e.g., percentage distribution). This type of graph is a circle divided into segments, with each segment representing a particular category and its proportion of the total. The area of each segment is the same proportion of a circle as the category is of the total data set.
Circle graphs are best used when there are few categories – ideally no more than six – otherwise, the resulting picture will be too complex to understand. Never use a three-dimensional pie chart, even when it's available as a graph option in spreadsheet software. The 3-D image is misleading because the surface area of some segments can appear larger than the actual proportions they represent.

A pictograph uses picture symbols to convey the meaning of categorical data. It is similar to a bar graph in that each horizontal or vertical row represents the frequency or number of responses in each category. Pictographs should be used carefully because the pictures may, either accidentally or deliberately, misrepresent the data.
For example, the cookie image in the pictograph below represents two students and the half-cookie image represents one student. Other types of pictographs may use an image that grows larger or smaller to represent changes in data. In such cases, care must be taken to ensure the size or area (total surface) of the picture is proportional to the change it is representing.

A histogram is used to summarize either discrete or continuous numerical data that are measured on an interval scale. It is often used to illustrate major features of the distribution of the data. A histogram divides the range of possible values into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group and an area proportional to the number of observations or frequency of that group. This means that the rectangles will be drawn of non-uniform height. A histogram has an appearance similar to a vertical bar graph, but when the variables are continuous, there are no gaps between the bars. When the variables are discrete, however, gaps should be left between the bars.


Scatter graphs are used to show a relationship between two variables by means of ordered pairs plotted on a coordinate grid. The data points are not joined; the resulting pattern indicates the type and strength of the relationship between the variables. A line of best fit can be drawn between the points when a relationship exists. Scatter graphs can illustrate data correlation, positive or negative relationships between variables, non-linear patterns, spread of data and outliers.

The mean, median and mode can help you capture, with a single number, what is typical of a set of data. For example, a typical Grade 8 class can be composed of 12 to 15 year olds. However, if we find more 13 year olds than any other age group, we use the modal age 13 to represent the age of grade eight students in that particular class. Depending on the situation, either the mean, the median or the mode may give the best description of a particular set of data.
The mean is the average value in a data set. It is calculated by adding all the data and dividing the sum by the total number of data items in the set.
The median is the middle value in a data set that has been arranged in numerical order – exactly half the data are above the median and half are below it. You must first arrange the data in ascending or descending order to determine its middle number. If there is an even number of data, you must average the two middle numbers to find the median.
The mode is the value that occurs most frequently in the set. When two numbers occur equally frequently the data are bi-modal.
In a normal distribution, the mean, median and mode are identical in value. For example, the following data set shows a normal distribution:Data set: 14, 14, 13, 15, 15, 14, 13, 14, 13, 15
mean: (14 + 14 + 13 + 15 + 15 + 14 + 13 + 14 + 13 +15) / 9 = 14
median: the median is 14 (13, 13, 13, 14, 14, 14, 14, 15, 15, 15)
mode: the most frequent number is 14