Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.
Presentation of raw data
Excursion analysis
Correlation coefficient analysis
The principal component analysis
In this chapter, we will begin our presentation of the raw data used for the study with a definition of the four sets of data, some descriptive statistics, the analysis of compliance with guidelines for each parameter and statistical charts. Then, we will analyse the excursion for each parameter through descriptive statistics and graphs. Then, we will present an analysis of the coefficient of correlation between parameters considered in pairs for the raw values and for the excursions in order to determine if there is strong correlation between them. Finally, a Principal Components Analysis is performed using the raw data to determine if each parameter is needed in the index calculation.
We will start by defining the sets of data we used to perform this study. This is followed by a descriptive analysis, an analysis of compliance with guidelines and graphic representation of certain parameters.
Below are the datasets made available to us by certain provinces. Please note that they contain different or additional parameters than those used for the WQI calculation in the Canadian Environmental Sustainability Indicators, 2005.
For each set of data, a chart containing the following descriptive statistics on the raw data for each of the parameters:
Discussion: A closer look at these tables shows that the mean is higher than the median for many parameters. As a matter of fact, most of the parameters follow a log-normal distribution1, and therefore the mean is "drawn" upwards by the extreme values. Therefore, the median is a better indication of the central tendency. Moreover, certain parameters have very high variability, which may be attributable to the relatively high values that should increase the value of the third term of the index.
This subsection presents the percentage of values that are compliant with the guidelines for each dataset. This analysis is meant to determine the parameters for which it is difficult to comply with the guidelines. The appendix includes guidelines for the protection of aquatic life used for each set of data to evaluate water quality.
Observations: 73.6% of samples comply with the guidelines for the Newfoundland and Labrador dataset. The aluminum (10.7%) and lead (27.5%) parameters rarely comply with guidelines, while the nickel parameter is always compliant.
Observations: 86.7% of samples comply with guidelines for the Ontario dataset. The phosphorus (49.6%) and cadmium (50.8%) parameters are rarely compliant, while the nickel (99.9%) and pH (99.9%) parameters are almost always compliant.
Observations: 87% of samples comply with guidelines for the British Columbia dataset. Only the cadmium parameter (8%) is severely challenged to meet its guidelines.
Observations: 87.1% of samples comply with guidelines for the Quebec dataset. The phosphorus parameter (60.4%) is more often non-compliant than other parameters.
Discussion: It is interesting to note that the compliance percentage varies from one parameter to another within the same set of data and also between various data. Also, for some parameters it is more difficult to respect the guidelines used by water quality experts than for other parameters, either because very high values were recorded during the sampling, or because the guidelines are difficult to comply with. Thus, the lower the parameter compliance percentage, the lower the index value.
Graphic representation in statistical analysis is very important because it gives us a direct, visual idea of the distribution of a variable. In our study, we present only the parameters that have a low compliance rate for each of the sets of data.
Each graph has its own scale. It ranges from the minimal value to the maximal value.
This section discusses excursions. These excursions enable us to measure the significance of variations in our observations as compared with their guidelines. By studying these excursions, we can determine the parameters that significantly exceed the guidelines. Please note that a zero value for the excursion means that the result complies with guidelines.
We will begin by presenting descriptive statistics of the excursion for each parameter, followed by a few charts showing the distribution of the excursion for parameters that have a low compliance rate.
Discussion: These tables and graphs show that certain parameters have a high variability caused by a large fluctuation of the observed excursions. This high variability can be the result of two factors: a very high value recorded during the water sampling, or a very strict guideline, difficult to comply with. The graphs indicate also that small discrepancies are more frequent than large ones, which means that the value of the third terms of the index will not only be affected by large excursions alone.
The purpose of this section is to look for strong correlation between two parameters, since it may be useful to keep only one of them in order to reduce colinearity between parameters that enter in the index calculation. For example, if we assume that the two parameters in question always comply with the guidelines, the value of the first term of the index would be smaller since its denominator would be larger; this would give a higher index value in view of a strong correlation between these two parameters.
We will present two types of correlation coefficients: the Pearson coefficient and the Spearman coefficient. The Pearson coefficient is calculated using raw data, while the observation rank is used for the calculation of Spearman's coefficient. Normally, the Pearson coefficient is used when observations follow a normal distribution, while Spearman's coefficient used when this condition is not met. The possible value of these two coefficients is a number from -1 to 1. If two parameters increase together, we consider the correlation positive, whereas if one parameter decreases when another one increases, the correlation between them is negative. The closer the correlation value is to 1 or -1, the stronger the relationship between the two parameters. These correlation coefficients will be analyzed using raw data and excursions.
Observations: Correlations between the parameters are not very high. Only five correlations (two Pearson and three Spearman) have a value higher than 0.50. Only one correlation recurs in both methods – the one between the aluminum and iron parameters (0.66 and 0.64). Therefore, a relation seems to exist between these parameters. Moreover, a relatively strong relation seems to exist between chloride and zinc (0.76).
Observations: The Spearman method generates ten correlations above 0.50, compared to none when we used Pearson's method. There seems to be a relatively strong relation between the nitrite and the nitrate (0.78), and between suspended solids (SS) and phosphorus (0.70). Moreover, copper seems to be the parameter most strongly correlated with four correlations above 0.50.
Observations: In all, this set of data contains 13 correlations above 0.50, including 8 generated by Spearman's method. The 5 Pearson's correlations with a value higher than 0.50 are also above 0.50 when the Spearman's method is used. Therefore, a relation seems to exist between the following parameters: cadmium / lead, copper/ phosphorus, phosphorus/ lead, phosphorus/ zinc and lead / zinc.
Observations: There are three correlations above 0.50 using Spearman's method and none with Pearson's method. A relatively strong correlation seems to exist between turbidity and phosphorus (0.75).
Observations: Spearman's method produces five correlations above 0.50 and the Pearson method, none. A relatively strong relation seems to exist between the pH and aluminum (0.77).
Observation: Only one correlation exists between suspended solids (SS) and phosphorus (0.53) that is higher than 0.50.
Observation: There is no correlation higher than 0.50 in this dataset.
Observation: There is only one correlation between turbidity and phosphorus (0.57) that is higher than 0.50.
Discussion: Generally, this analysis indicates that there appears to be a relationship between certain parameters. However, these correlations are not very strong (there is no correlation higher than 0.80). In fact, only five correlations have values between 0.70 and 0.78. Among these five correlations, two come from the Ontario dataset (phosphorus - SS and nitrate - nitrite). This province's water quality expert uses seven parameters (the nitrite and SS parameters are not used) for the index calculation, as opposed to this study, which uses the maximum number of parameters available to study index behaviour. It is also interesting to note that water quality experts are careful not to include strongly correlated parameters in the index calculation, thus reducing the potential for colinearity.
The Principal Components Analysis (PCA) is a descriptive technique enabling the study of relationships between the parameters. It seeks to reduce the number of parameters in our dataset to several large dimensions (factors), by looking for a solution of the overall variance of the measured parameters. The PCA will enable us to determine whether each parameter is important in the index calculation by checking whether all parameters are represented by at least one factor. Alternatively, if each factor is represented by only one parameter, these parameters together can represent all data.
Before performing the PCA, we must make sure that the parameters used for the analysis follow a normal distribution. Given that water quality experts have estimated that the vast majority of variables follow a log-normal distribution, we will use the transformation log_natural (parameter + 1) to obtain parameters subject to a normal distribution. The term + 1 is used in the conversion to account for the fact that log_natural of 0 is impossible.
After performing the PCA, we must examine the Kaiser-Meyer-Olkin (KMO) measure used as an adequacy indicator of the factor solution that specifies the extent to which the chosen set of parameters constitutes a coherent whole. A high KMO indicates the existence of a statistically acceptable factor solution representing relations between the parameters. A KMO value below 0.5 is considered unacceptable, while a value of 0.7 is considered relatively acceptable. KMO reflects the relation of correlations between the parameters to partial correlations. These partial correlations reflect the uniqueness of each parameter's contribution.
To conduct this analysis, we will determine the appropriate number of eigen values to be chosen and study the behaviour of each of parameters on the various factor axes. We will keep all factors with an eigen value above 1. To determine the axes, we will use orthogonal rotation, assuming that the factors are independent of each other in order to better illustrate the findings. Moreover, to associate a parameter to a factor, a saturation (correlation between the parameter and the factor it is associated with) of at least 0.40 is considered necessary (Guadagnola and Velicer, 1988).
For the Newfoundland and Labrador dataset, we will keep the first three factors, which have an eigen value above 1. These three factors represent 62.29% of the explained data variance with a KMO of 0.68, which means that a relatively acceptable factor solution exists.
Table 23
Representation of parameters along three factorial axes for the Newfoundland and Labrador dataset
Observations: Each parameter is represented by one factor (saturation coefficient higher than 0.40), except for pH, which is represented by two factors. Also, the first factor is represented by the following parameters: chloride, copper, lead, phosphorus and zinc while the parameters aluminum, iron, nickel and pH represent the second factor. Finally, the third factor is represented by the dissolved oxygen and pH parameters. Given that each parameter is represented at least by one factor, we can state that all the parameters have a relatively high importance in the index calculation. This means that there is no smaller group of parameters that could represent the entire dataset.
For the Ontario dataset, we performed a Principal Components Analysis for all parameters available to us, as well as those used by the Ontario water quality expert for the WQI calculation.
For the entire dataset, we use the first four factors, which represent 56.9% of the explained data variance with a KMO of 0.79, but for the partial dataset, we keep the first two factors, which account for 53.78 % of the data variance with a KMO of 0.73.
Observations: For the complete set of data, it seems that the importance of the lead parameter is relatively low in the index calculation, since it is not associated with any factor, while all of the other parameters are associated with a single factor. For the partial dataset, each parameter is represented by one factor except for phosphorus, which is represented by two factors. In both cases, since each parameter is represented by at least one factor except for the lead parameter in the complete dataset, we can state that each parameter (except for the lead parameter) has a relatively high importance in the index calculation.
For the British Columbia dataset, the first three factors account for 62.76% of the data variance with a KMO of 0.77.
Observation: Each parameter is represented by at least one factor, which means that all of them have a relatively high importance in the index calculation.
For the complete Quebec dataset, the first two factors have an eigen value higher than one, representing 60.82% of the explained data variance with a KMO of 0.74.
Observation: Since each parameter is represented by at least one factor, we can state that each parameter has a relatively high importance in the index calculation.
Discussion: The principal components analysis showed us that each of the parameters used by the water quality experts, except for the lead parameter in the Ontario dataset, is important in the index calculation, as compared with the correlation analysis which allowed us to determine whether there is redundancy in the parameters.