Examples will be presented of the use and usefulness of multielement data, and multivariate statistics, in a variety of exploration situations in Africa and the Americas.
An Untapped Resource
Reliable multielement analyses are available in North America for as little as $6 per sample and are being requested increasingly for routine geochemical samples, but many geologists who use geochemical methods in exploration still tend to eschew any methodology that mentions the word "multivariate". Consequently, multielement data sets tend to be underutilised.
Whether or not they are subjected to sophisticated statistical treatment, multielement analyses constitute a valuable resource of information. The elements offered in common analytical packages include pathfinders for gold mineralization (Sb, As, Au, Mo, Pb, Ag, W); pathfinders for base-metal sulphide mineralization (Ba, Cd, Cu, Pb, Ag, Zn); indicators of felsic rocks (Ba, Be, K); indicators of mafic and ultramafic rocks (Cr, Co, Ni, Ti, V); and indicators of calcareous rocks (Ba, Ca, Mg, Sr). It is important to realize, however (and most reputable labs point this out to their customers) that the digestion methods used routinely in wet-chemical analyses are only capable of partial solution of many of these elements.
The truth of the expression "garbage in, garbage out" is nowhere more vividly demonstrated than in data analysis. A significant proportion of the time expended on any interpretation exercise is (or should be) taken up with checking the integrity of the data set. Typically, a file of analyses may contain missing analyses values recorded as zeros; "undetectable" values expressed as zeros; inequalities or negative numbers, displaced or mixed columns of analyses and coordinates; and quality-assurance data that has not been fully evaluated. Furthermore, for any conclusions derived from them to be valid, many statistical techniques require the assumption of a Normal distribution for all of their input variables. Most geochemical variables do not, in their "raw" state, display such a distribution and it is necessary to apply an appropriate transformation prior to data analysis. Interpretation is also aided greatly if the collectors of the samples make and record certain key characteristics of the material they are sampling and the site from which it is collected.
Methods exist for dealing with multielement data that do not involve multivariate statistics sensu stricto. The calculation of multielement indices is an example of how the element associations described above can be applied to optimize the response to certain mineralization types, or important lithologies.. It may, nevertheless, be felt that certain elements are deserving of greater weighting in such an index because of their greater importance as pathfinders for the deposit type sought. Gold and arsenic, for example, might be accorded greater weight than copper or lead in an index designed to detect manifestations of lode-gold mineralization. A spreadsheet application enables these calculations to be readily performed on large data sets. To deal with the "apples-and-oranges" problem, the values of each variable can be normalized or converted to percentiles of the overall population, or of subpopulations based on observable criteria such as mapped geology or regolith type.
While indices are useful in many situations, they are an example of an alluring method that can lead to meaningless results if attention is not paid to geological realities. Furthermore, while the element associations are well enough documented in particular mineral-deposit types, the associations in commonly-sampled media may be modified by surface processes that are not so well understood. In such situations it may be preferable to derive the indices empirically from the data and interpret their significance a posteriori; this is the primary role of multivariate methods in exploration geochemistry.
Bivariate Statistical Methods (Correlation)
The Pearson Correlation Coefficient, if correctly applied, is a useful quantification of the strength of the linear interdependence between two variables. Like other summary statistics, it is susceptible to abnormalities in the nature of the distribution which must be rectified before any important conclusions are drawn, or if the correlation coefficient is used as input to other statistical methods like factor analysis.
The correlations between the variables in a data set are generally summarized in the form of a triangular correlation matrix. The understanding of such a matrix is improved by converting the magnitude of the correlation coefficients to a series of coloured or sized symbols. Surprisingly, most statistical software packages do not provide such an option.
Multivariate Statistical Methods
Computer processing is mandatory in the application of most multivariate statistical methods as it would be prohibitively difficult and time-consuming to undertake it manually. These techniques can be usefully applied without detailed understanding of the underlying mathematics, although some understanding of how they work (and the circumstances under which they do not) is mandatory -- and the key to the understanding of these apparently arcane methods lies in the graphical demonstration of a 2D (bivariate) situation and its intuitive extension into higher dimensions ("hyperspace").
When deriving any kind of summary statistics, multivariate or otherwise, from a large data set it is important to decide whether the feature sought constitutes a statistical rarity in the sampled population. In a regional drainage survey, for example, the major controls on each sample's composition are likely to consist of gross lithological and surficial or environmental agencies. The predominant element associations (factors), sample associations (clusters) or inter-element and intersample relationships of other kinds, like regression equations, are likely to reflect these controls, and unlikely to reveal much about mineralization, if its presence is manifested in only a few samples. On the other hand, the presence or proximity of mineralization is more likely to exert a discernible influence on the data as a whole, in the follow-up survey of a previously-defined anomaly.
Even when the response to mineralization constitutes a statistical rarity, it is sometimes possible to turn the method around and quantify the component of each element's composition that cannot be explained by the action of these dominant agencies. This residual component is more likely to be related to an unusual situation of which the presence or proximity of mineralization constitutes one example.
Whereas the correlation coefficient is a measure of the strength of the relationship between two variables, regression analysis provides a means of expressing its nature in quantitative terms. In the case of simple linear regression, a set of bivariate data, expressed graphically as an X-Y plot, is fitted with a straight line, that may or may not pass through the origin. This line represents the best estimate of the relationship between what is termed the dependent variable (which is normally plotted on the y-axis) and the independent variable (x-axis) though no cause-and-effect relationship need be implied. Polynomial regression involves the fitting of a curve, rather than a straight line, to the scatterplot, while multiple regression involves the admission of more than one independent variable; this is analogous to fitting a surface, rather than a line, to a set of points in three (or more) dimensions.
While the observation that a relationship can be established between two geochemical variables may be of academic interest, the principal advantage of regression analysis, as applied to geochemical exploration, is in the isolation, in each sample, of the residual component of the dependent variable that cannot be predicted from the independent variable(s). A positive residual value indicates that the dependent variable is higher than predicted, while a negative value indicates that it is lower.
Most geochemical variables are measured on a continuous ratio scale. However, their ultimate function for the explorationist is as an aid to answering a single question to which there are only two answers (do we follow this up, or not?). When applied to single geochemical variables, this is the function of the much-abused "threshold" value.
When several variables are brought into play simultaneously, the assigning of samples to predefined groups can be achieved more rigorously, and this is the function of discriminant analysis, which is a multivariate method used to treat problems of classification. It is most commonly applied to situations where there are two previously-defined "training sets", which differ in some important, observable characteristic. From the multivariate observations that make up these two training sets, a single, data-specific discriminant function is derived. Solution of the function for the data on a single geochemical sample yields an index known as the discriminant score which quantifies the affinity of the sample to one of the previously- defined training sets.
The method is useful in two-group situations where it is necessary to discriminate and classify "mineralized" and "unmineralized" or "altered" and "unaltered" samples, where other potential inhomogeneities of the sample medium are either insignificant, or have been compensated prior to application. Modifications have, however, been devised for situations where more than one group has been identified (for example, when multiple lithologies are present within the unmapped area of a regional stream- or lake-sediment survey).
A typical multielement data set may consist of analyses for up to 30 different elements, but it is unlikely that these elements were emplaced in the soils by 30 different element-specific processes. Furthermore, the amount of a particular element in a sample is unlikely to be the result of only one process acting on the sample material. The strength of the correlations between certain elements in most naturally-occurring media bears witness to this.
Factor Analysis is a general term given to a variety of related techniques which seek to identify a limited number of controls on a much greater number of observational variables. These controls are modelled in the form of linear combinations of those variables, termed "factors". In geochemistry, it is reasonable to suppose that such factors will be more closely related to the processes that have acted on the naturally-occurring medium in question, than are the individual elements. Unlike the multielement indices described above, the "loadings" on the individual elements are determined from the data, rather than preconceived notions regarding their associations; however, the significance of a particular factor may be interpreted in the light of the relationship between its heavily loaded elements, and the natural processes under which they are known to be mobile. A useful bi-product of Factor Analysis is that it often provides a means of concisely describing and summarizing the behaviour of a large number of elements in a geochemical data set.
For each factor, a "factor score", quantifying the influence of the factor in each sample, can be calculated. Factor scores can be plotted and contoured like any geochemical variable and it is often from their areal distribution that the most useful information can be gleaned. Once again, the factors themselves are unlikely to model the mineralization process, or the dispersion of its products, unless the data are from a detailed follow-up survey. The manifestation of mineralization may, however, be detectable in a few samples as the residual components of each geochemical value that cannot be explained by the factor model.
Discriminant analysis was described above as a suitable method for the classification of "unknown" geochemical samples, based on the multivariate characteristics of samples of known affiliation. The role of cluster analysis is to seek and identify such groupings within a multivariate data set, without a priori information.
Clustering methods can begin with the assumption that every case in the data set represents a single cluster of points in multidimensional space; they are then agglomerated based on their mutual separation in multidimensional space. Alternatively, the initial assumption can be made that the data form a single cluster, which is then modified by splitting the cluster up into smaller groups, to one of which each sample is assigned based on the distance between its plotted points and the centroid of the cluster. The clusters which are extracted can be interpreted in terms of the elements that are elevated and depleted in them, the areal distribution of the samples assigned to them, and any observable geological, geomorphological or environmental observations that characterize them.
A related method variously termed multidimensional Scaling, planing or nonlinear mapping, seeks to create a 2D projection of point data plotted in hyperspace in such a way as to minimize the discrepancy between interpoint distances in the 2D projection, and the same interpoint distances in hyperspace. Though computer-intensive, this is surprisingly easy to achieve. The principal advantage of such a method, which can be applied both to samples, as with cluster analysis, or with variables, as with factor analysis, is that coassociations between the projected points can be identified with the human eye, which is much more efficient in this regard than any mathematical algorithm.
A Final Note of Caution
The examples presented in this paper will demonstrate that when intelligently combined with clear geological reasoning, multielement analyses can more than pay for themselves in the assistance they provide to the mineral explorationist. However, if the conclusions arising from a multielement interpretation fly in the face of observable geological realities (not geological inferences, however dearly cherished), it is inappropriate to attach too much significance to them.