 1.1 Introduction
 1.2 Principal Component Analysis (PCA)
 1.2.1 Critical Assumptions Underlying PCA
 1.2.2 Number and Significance of Explanatory Variables Loading on a PC
 1.2.3 Number of Extractable PCs and Their Characteristics
 1.2.4 Total Variance of the Dataset
 1.2.5 What is an Adequate Sample Size for PCA and Further Forms of MV Analysis?
 1.2.6 Interpretability Criteria of PCs
 1.2.7 Varimax Rotation
 1.2.8 Example Case Study
 1.2.9 Examination of a Wider Range of Components
 1.2.10 Consideration of Type I (FalsePositive) Errors
 1.2.11 Determinations of the Suitability of MV Datasets for Analysis with PCA and FA
 1.3 Partial Least SquaresDiscriminatory Analysis (PLSDA)
 1.3.1 Case Study Describing an Example of PLSDA ‘Overfitting’
 1.3.2 Permutation Testing
 1.3.3 Procedures for the Validation and CrossValidation of PLSDA Models
 1.3.4 Attainment of the Final Calibration Model
 1.3.5 Quality Evaluation Processes
 1.3.6 CostBenefit Analysis (CBA)
CHAPTER 1: Introduction to the Applications of Chemometric Techniques in ‘Omics’ Research: Common Pitfalls, Misconceptions and ‘Rights and Wrongs’

Published:06 Nov 2014

Series: Issues in Toxicology
M. Grootveld, in Metabolic Profiling: Disease and Xenobiotics, ed. M. Grootveld, The Royal Society of Chemistry, 2014, pp. 134.
Download citation file:
In this introductory chapter, the two most commonly employed techniques available for the multivariate (MV) analysis of multianalyte, highdimensional metabolomics or genomic datasets, specifically principal component analysis (PCA) and partial least squaresdiscriminatory analysis (PLSDA), are reviewed in detail (the former represents an unsupervised exploratory data analysis technique, whilst the latter is a supervised pattern recognition one). In particular, the pivotal requirements of each of these methods for the satisfaction of critical assumptions required (normality and homoscedasticity of potential MV predictor X variables, the linearity of relationships between each of these, etc.), the interpretabilities of results derived therefrom, and the assurance and actuation of adequate sample sizes for their performance, for example, are reviewed. For the PLSDA approach, the critical requirement for a minimum sample size is discussed in much detail, together with considerations of the danger of ‘overfitting’, with a typical example. The highly valued importance of validation, crossvalidation and random permutation testing (and also the range of methods available for the satisfactory performance of these processes) is also critically delineated, in addition to means available for model quality evaluation.
1.1 Introduction
In this first chapter, I shall focus mainly on the two most widely employed multivariate (MV) assessment systems available in practice, specifically Principal Component Analysis (PCA) and Partial Least Squares methods, particularly Partial Least SquaresDiscriminatory Analysis (PLSDA), the first of which is an unsupervised exploratory dataset analysis (EDA) method, the second being a supervised pattern recognition technique (PRT). I have chosen to concentrate on these particular MV analysis methods here since there are numerous documented examples of the applications of these in the scientific, biomedical and/or clinical research areas in which they have sometimes been employed inappropriately, to say the least! Further details regarding the principles and modular applications of these two MV analysis approaches are provided in Appendices I and II.
1.2 Principal Component Analysis (PCA)
The applications of Principal Component Analysis (PCA)^{1,2 } to the interpretation of MV metabolomic or chemometric datasets are manifold, and this is, perhaps, one of the most extensively applied techniques, examples of which are provided in refs 3–7, and which is sometimes employed in the first instance, if only for the detection and removal of statistical ‘outlier’ samples. The principles of this method involve the reduction of a large MV dataset (such as that arising from the ‘bucketed’ ^{1}H NMR analysis of, say, a collection of biofluid samples, tissue biopsies or their extracts, or otherwise) to a much smaller number of ‘artificial’ variables known as Principal Components (PCs), which represent linear combinations of the primary (raw) dataset ‘predictor’ variables and, hopefully, will account for at least some, if not most, of their variance. These PCs can then, at least in principle, be employed as ‘predictor’ or criterion (X′) variables in subsequent forms of analyses. It is clearly a valuable technique to apply when at least some level of ‘redundancy’ is suspected in the dataset, i.e. when some of the X variables are correlated or highly correlated (either positively or negatively) with each another. In metabolomics experiments, it is often the case that one or more (perhaps many) biofluid metabolite concentrations (or proportionately related parameters such as a resonance, signal or peak intensity) will be significantly correlated with one (or more) others, either positively or negatively. Obviously, in such situations, many of the predictor (X) variables can be rendered redundant, and this forms the basis of the PCA technique in terms of its dimensionality reduction strategy.
PCA is a procedure that converts a very large number of ‘independent’ variables (more realistically described as ‘interdependent’ variables in view of their multicorrelational status), i.e. 0.02–0.06 ppm ^{1}H NMR spectral ‘buckets’ (which have variable frequency ranges if ‘intelligently selected’, and constant, uniform ones if not, the latter often being a preselected size of 0.04 or 0.05 ppm), many of which are correlated into a smaller number of uncorrelated PCs. Hence, a major objective of this form of multivariate analysis is to alleviate the dimensionality (i.e. the number of independent, possible ‘predictor’ variables) of the dataset whilst retaining as much of the original variance as possible. Hence, the first (primary) principal component is that which explains as much of the total variance as possible, the second as much of the remaining variance as possible, and so on with each succeeding PC until one with little or no contribution to variance is encountered; all components are, of course, orthogonal to (i.e. uncorrelated with) each other.
PCA can effectively delineate differing classifications within MV metabolomics datasets, and this is conducted according to the following procedure:
The data matrix is reduced to the much smaller number of PCs describing maximum variance within the dataset through decomposition of the X predictor variable matrix (containing the integral NMR buckets) into T score (containing class information projections of sample data onto each principal component through displacement from the origin) and P loading (describing the variables that influence the scores) matrices, such that X=t_{1}·p_{1}^{T}+‥+t_{A}·p_{A}^{T}, where the subscripted A value represents the total number of PCs, the residual information being included in a residual matrix E. The first PC should contain the maximum level of variance in the X matrix, such that the resulting deflated X matrix is then employed to seek a second component, orthogonal to the first, with the second highest variance contribution, and so on. PCA loadings with large values correspond to variables that have particularly high variance contributions towards them, and therefore they impart more to the total variance of the model system investigated.
However, there still remains much confusion regarding differences between the PCA and exploratory Factor Analysis (FA) techniques. Although similar in many respects (many of the stages followed are virtually identical), one of the most important conceptual differences between the two methods lies with the assumption of an underlying causal structure with FA (but not with PCA). Indeed, the FA technique relies on the assumption that covariation in the observed X variables is ascribable to the presence of one or several latent variables (or factors) that can (or do) exert a causal influence on the X variable dataset.^{8,9 } Indeed, researchers often use FA when they are perhaps aware of a causal influence of latent factors on the dataset (for example, the clear influence of thyroid disease status on blood plasma thyroxine levels, or a type 1/type 2 diabetes disease classification on blood plasma glucose and, where appropriate, ketone body concentrations), and this technique has been much more extensively employed in, for example, the social and environmental science areas rather than in metabolomics research; hence, an exploratory FA permits researchers to identify the nature, total number and relative influence of these latent factors.^{10 } Similarly, for sufficiently large MV datasets, the multiple FA (MFA) method serves to determine underlying relationships or ‘signatures’ between a series of causal latent variables and the MV dataset attained. In FA or MFA, we may also add the ‘diagnostic’ or other variables as supplementary ones rather than as latent causal factors.
For PCA, however, no prior assumptions regarding potential underlying causal latent variables are made; indeed, it is simply a dimensional alleviation technique that gives rise to a (relatively) much smaller number of (uncorrelated) PCs which account for as much of the MV dataset as possible (although the influence of or differences between such latent or explanatory variables are, of course, frequently investigated in a metabolomics sense).
Since PCs are defined as linear combinations of optimally weighted predictor (X) variables, it is possible to determine the ‘scores’ vectors of each one on each PC, which is considered significant (commonly determined via a Scree plot^{11 }). For example, the first PC may be primarily ascribable to selected metabolic differences between two (or more) disease classification groups, whereas the second may arise from a second series of perhaps unknown, unrelated metabolic perturbations, or alternatively a further influential (perhaps latent) variable such as dietary habit or history, or further differences between sample donors, for example those regarding gender, age, family, ethnicity status, etc. Figure 1.1 shows typical Scree plots arising from the metabolomic PCA of intelligently bucketed datasets arising from the ^{1}H NMR analysis of (a) human salivary supernatants (with 209 predictor variables, 480 samples and 2 oral health disease classifications) and (b) human urine (with only 22 predictor variables, 60 samples and again 2 disease classifications). For this latter example, we selected the most important bucket predictor variables via the prior performance of (1) modeldirected repetitions (>60 times) of the logistic model of correlated component regression (CCR, as outlined in Chapter 3) with corresponding validation, crossvalidation (CV) and permutation testing, and (2) selected computational intelligence techniques, again with accompanying validation, crossvalidation and permutation testing. For these Scree plots displayed in Figures 1.1(a) and (b), and Tables 1.1(a) and (b), respectively, list the number of PCs with eigenvalues >1, and their corresponding eigenvalues (i.e., the mean number of predictor X varoables per PC), together with the percentage of total variances accounted for by these PCs (the latter both individually and cumulative). From Figure 1.1(a) and Table 1.1(a), it can be observed that 14 PCs had eigenvalues >1, the first (PC1) with an eigenvalue of 121.66 (i.e. a mean value of 121.66 positively and/or negatively correlated predictor variables are responsible for it), the second 27 or so, the third 11 and the fourth 10, etc.; these first four PCs account for 58.2%, 12.85%, 5.3% and 4.9% of the total variance, respectively (total 81.2%). In Figure 1.1(b), however, only 8 PCs had eigenvalues >1, the first five accounting for only ca. 60% of the total variance. It should also be noted from Figure 1.1(b) that the Scree plot appears to have more than one simple breakpoint, the first after PC6, the second after PC12 (although PCs 9–12 are considered irrelevant since their eigenvalues are all <1). Therefore, for this latter example, it would appear that only PCs 1–6 should be considered as providing valuable MV information.
Table 1.1
(a) .  .  .  . 

PC .  Eigenvalue .  % Variance explained .  % Cumulative variability . 
PC1  121.66  58.21  58.21 
PC2  26.86  12.85  71.06 
PC3  11.02  5.27  76.34 
PC4  10.19  4.87  81.21 
PC5  7.59  3.63  84.84 
PC6  6.59  3.15  87.99 
PC7  3.89  1.86  89.85 
PC8  3.05  1.46  91.31 
PC9  2.25  1.08  92.39 
PC10  1.86  0.89  93.28 
PC11  1.35  0.65  93.93 
PC12  1.22  0.58  94.51 
PC13  1.11  0.53  95.04 
PC14  1.04  0.5  95.54 
(a) .  .  .  . 

PC .  Eigenvalue .  % Variance explained .  % Cumulative variability . 
PC1  121.66  58.21  58.21 
PC2  26.86  12.85  71.06 
PC3  11.02  5.27  76.34 
PC4  10.19  4.87  81.21 
PC5  7.59  3.63  84.84 
PC6  6.59  3.15  87.99 
PC7  3.89  1.86  89.85 
PC8  3.05  1.46  91.31 
PC9  2.25  1.08  92.39 
PC10  1.86  0.89  93.28 
PC11  1.35  0.65  93.93 
PC12  1.22  0.58  94.51 
PC13  1.11  0.53  95.04 
PC14  1.04  0.5  95.54 
(b) .  .  .  . 

PC .  Eigenvalue .  % Variance explained .  % Cumulative variability . 
PC1  3.60  16.36  16.36 
PC2  3.12  14.17  30.53 
PC3  2.46  11.18  41.71 
PC4  2.06  9.35  51.06 
PC5  1.94  8.82  59.88 
PC6  1.61  7.32  67.20 
PC7  1.18  5.35  72.55 
PC8  1.04  4.75  77.30 
(b) .  .  .  . 

PC .  Eigenvalue .  % Variance explained .  % Cumulative variability . 
PC1  3.60  16.36  16.36 
PC2  3.12  14.17  30.53 
PC3  2.46  11.18  41.71 
PC4  2.06  9.35  51.06 
PC5  1.94  8.82  59.88 
PC6  1.61  7.32  67.20 
PC7  1.18  5.35  72.55 
PC8  1.04  4.75  77.30 
1.2.1 Critical Assumptions Underlying PCA
Now here's the difficult part! Indeed, this is where a lot of PCA applications to the analysis of metabolomics/chemometric datasets fall down, and hence fail or completely fail to provide satisfactory models for the diagnosis of human diseases, determinations of their severities, or responses to treatment, etc.
As with many alternative MV analysis techniques, the satisfactory application of PCA to the recognition of patterns or ‘signatures’ of metabolic biomarkers in metabolomics datasets (^{1}H NMRderived or otherwise) is critically dependent on the satisfaction of a series of assumptions. Unfortunately, such assumptions are rarely checked, evaluated or monitored prior to the performance of PCA, and hence results acquired can hardly be considered as having a sound basis. However, as noted below, some of these assumptions are of much more importance than others, and the technique serves to be relatively robust to violations of the selected criteria required.
These assumptions are:
Primarily, since PCA is conducted on the analysis of a matrix of Pearson correlation coefficients, datasets acquired should satisfy all the relevant assumptions required for this statistic.
A random sampling design should be employed, and hence each biofluid, tissue or alternative sample should contribute one, and only one, value (specifically, metabolite concentration or related measure, normalised and/or standardised) towards each observed ‘predictor’ (X) variable; these values should ideally represent those from a random sample drawn from the population(s) investigated.
All biomolecule predictor (X) variables should be evaluated on suitable concentration (or directly proportional spectroscopic or chromatographic intensity measures), concentration interval or concentration ratio measurement levels.
Each predictor variable measurement (for example, concentration or signal intensity) should be distributed normally, and those that deviate from this (i.e. those that demonstrate a limited level of kurtosis or skewness) can, at least in principle, be appropriately transformed in order to satisfy this assumption.
Each pair of predictor (X) variables in the plethora of those available in an MV dataset should conform to a bivariate normal distribution; specifically, plots derived therefrom should form an elliptical scattergram. Notwithstanding, Pearson correlation coefficients are remarkably robust against deviations from this assumption when the sample size is large (although this is often not the case in metabolomics experiments!). However, selected MV analysis techniques such as independent component analysis (ICA), which is covered in Chapter 3, also allow for quadratic or higher order polynomial relationships between the exploratory variables (although selected transformations of the dataset acquired may serve to convert such nonlinear relationships to linear or approximately linear ones). An example which describes the application of a series of four such tests of normality for a large number of predictor X variables within a ^{1}H NMR multivariate ‘intelligently bucketed’ urinary dataset is provided in Chapter 2. Appropriate transformations for the conversion of such nonnormally distributed X variable datasets include the logarithmic (log_{10} or log_{e}) transformation for variables in which the standard deviation is proportional to the mean value (in this case, the distribution is positively skewed); the square root transformation for variables in which the estimated variance (s^{2}) is proportional to the mean (which frequently occurs in cases where the variables represent counts such as the number of abnormal cells within a microscopic field, etc.); the reciprocal transformation for variables with standard deviations proportional to the square of the mean (this is usually applied to highly variable predictors such as blood serum creatinine concentrations); the arcsine (%)^{1/2} transformation for variables expressed as percentages, which tend to be binomially distributed (this transformation is likely to have some application to MV metabolomic datasets which have been normalised to a constant sum (say 100%) both with and without their subjection to the subsequent standardisation preprocessing step, details of which are provided in Chapter 2). Of course, the standardisation process (involving meancentring and unitvariance scaling), will provide variables with mean values of zero and standard deviations and variance values of unity, and hence the performance of such transformations may be considered inappropriate). However, this standardisation process will certainly not achieve the conversion of a significantly skewed distribution into a nonskewed, symmetrical and perfectly normally distributed one!
Watch out for outliers! The presence of even just one outlying data point can sometimes give rise to a strong (but overall false!) apparent correlation between, say, two metabolite levels, even if the complete dataset has been subjected to normalisation (row operation) and standardisation (column operation) procedures. Figure 1.2 shows an example of how this might arise. In addition to checking for outlying biofluid or tissue samples, which can easily be achieved by examinations of two or threedimensional PCA scores plots (such samples may occur from their collection from study participants taking or receiving project or clinical trialunauthorised medication, or further programmeprohibited agents such as alcoholic beverages, for example), researchers should also endeavour to check all the predictor variables individually for such outlying data points, and perhaps remove them if proven necessary. In this manner, we can at least be confident that each predictor variable (column) dataset is outlierfree and will not be violating the ‘nooutlier’ assumption.
1.2.2 Number and Significance of Explanatory Variables Loading on a PC
When one or more explanatory variables, biomolecular or otherwise, load on a principal component, it is highly desirable for researchers to have an absolute minimum of three or so of these X variables per component; indeed, it is generally considered good practice to retain five or more of these variables per component, since some of these may be subsequently removed from the diagnostic criteria developed. However, in metabolomics datasets consisting of perhaps 200 or more of such variables (such as those generated from the highresolution ^{1}H NMR or LCMS analysis of selected biofluids), it is not uncommon to encounter PCs that contain as many as 100–1000 or more of these X variables, which are all correlated (positively and/or negatively), and hence have autonomy and perhaps independence regarding their contributions to successive PCs, i.e. those which account for less and less of the total variance encountered in the dataset.
A further important consideration is whether or not a particular potential (biomolecular) explanatory X variable significantly loads on a specified component: this is generally considered the case if its PC loading value is >0.40. It is the author's view that these loadings should be checked and monitored more closely during the MV analysis of large or very large metabolomics datasets, since this does not seem to occur very often in the extensive range of publications available which have been extensively surveyed by the author! However, if indeed this is the case, then the predictor (X) variable can be considered as one which significantly contributes to a particular PC.
1.2.3 Number of Extractable PCs and Their Characteristics
A PC is defined as a linear combination of ‘optimally weighted’ predictor (X) variables, and here ‘optimally weighted’ indicates that these variables are weighted in such a manner so that the PCs arising therefrom account for the maximal proportion of variance in the complete dataset; the ‘linear combination’ descriptor refers to the information that the particular scores on a component are generated by a simple simulation of those on each X variable.
In many PCAs performed on metabolomics datasets (^{1}H NMRderived or otherwise), usually it is only the primarily extractable components (say, up to 6, but this value can often be as many as 10 to 20) which qualify for retention, further interpretation and employment in any further forms of analyses (MV or alternative methods). The remaining PCs (which can, in principle, represent a very large number from typical metabolomics datasets containing 100–200 or more X variables) are likely to account for only trivial levels of the complete X variable dataset variance, and hence may be removed from the analysis. Of particular importance is the deletion of those PCs which have eigenvalues <1, i.e. those with an average number of <1 predictor variable per component.
The first PC derived from the PCA of a metabolomics dataset will, of course, account for a maximal quantity of the total variance in the observed predictor (X) values, and hence it will also be significantly correlated with at least some (perhaps as many as 100 or so) of them. However, the second PC, which will account for the second largest percentage of such variance (and which was not accounted for by the first PC), is correlated with a smaller number of X variables that did not exhibit strong correlations with PC1. One major property of PC2 (and also subsequent PCs, i.e. PC3, PC4, PC5, etc.) is that it will be completely uncorrelated with PC1, i.e. the two PCs are orthogonal. Of course, the remaining PCs account for lower and lower percentages of the total X dataset variance and, again, they are all uncorrelated with each other, together with the first two (primary) PCs.
1.2.4 Total Variance of the Dataset
Since each of the observed variables is, in general, standardised during the course of PCA (although, for particular reasons, not always!), and each X variable therefore has a mean value of zero and unit variance (and hence unit standard deviation), the total variance of the dataset is therefore the sum of the observed variables’ variances, and hence is equivalent to the number of X variables subjected to analysis in this manner. As an example, if 180 X variables are being considered and analysed, the total variance will be 180, and the extracted PCs effectively partition this variance, with PC1 perhaps accounting for 23 total variance units, the second PC (PC2) perhaps for 13 units, and so on, and the PCA proceeds until all the dataset variance has been accounted for (although realistically it should be terminated when one of the eigenvalues has a <1 value).
1.2.5 What is an Adequate Sample Size for PCA and Further Forms of MV Analysis?
Basic PCA theory suggests that, since the method is designed as a large or very large sample process, the minimum number of samples subjected to analysis (by ^{1}H NMR, FTIR or LCMS techniques, for example) should be the larger of 100 or 5 times the number of ‘predictor’ X variables. Therefore, if we have a ^{1}H NMR dataset with 200 or so resonance intensity buckets (‘intelligently selected’ or otherwise), then we should, at least in principle, have a sample size of 1000 or more! This clearly has implications for many such metabolomic investigations – indeed, the author has often seen many examples in the scientific or clinical research areas where the disease or response status of a series of biofluid samples have been ‘correctly’ classified from datasets containing only 20–30 or so samples (rows), sometimes as few as 10–12, and the number of intensity buckets (columns) approaches or is greater than 200! This sample size problem represents a major assumptive criterion in this research area, and many researchers clearly fail to allow for this, a factor which can regularly give rise to the ‘overfitting’ of experimental datasets to selected models in many further forms of MV statistical analysis (particularly the supervised PLSDA technique, which has a reputation for being ‘overeager to satisfy’!).^{12,13 } However, PCA is somewhat less susceptible to this problem since it is an unsupervised EDA technique.
As expected, if, in an experimental design, we select, say, 300 participants to serve as donors for a particular biofluid sample (with adequate control for the potential interference of xenobiotic agents), it is highly, if not extremely, likely that one or several of these may not be able to provide samples (or, for that matter, insufficient volumes of them), and hence they will not enter into the final analysis; a finite number of participants can always be expected to fail to provide specimens under the required prespecified conditions of the experiment, and/or at the correct timepoints, if appropriate (as specified in a Participant Information Sheet approved by the particular Research Ethics Committee involved). Therefore, it is always sensible to recruit a larger number of participants to the study (via its experimental design), say 350 in this case in order to allow for this.
It should also be noted that these sample size criteria only represent minimum (lower level) requirements, and some researchers have made strong arguments that they should only be applicable if, firstly, many X (‘predictor’) variables are expected to load on each contributory PC, and, secondly, if the variable communalities are high, specifically if a particular X variable loads substantially on at least one of the retained PCs.
1.2.6 Interpretability Criteria of PCs
For each retained PC, it is of much importance to confirm that our interpretation of them makes ‘metabolic sense’ regarding the nature of the explanatory X variables employed, i.e. those which are found to load on each component. Basic selection criteria for this include requirements for (1) an absolute minimum of three variables, each with significant loadings on a particular retained PC; (2) the variables significantly loading on a selected PC sharing the same conceptual (metabolic) interpretation, i.e. perhaps these loadings on a selected PC arise from or relate to a disturbance in a particular metabolic pathway (perhaps only partially)?; (3) the differing X variables loading on differing PCs to reflect differing constructs (e.g. if five metabolites load significantly on PC1, and four further ones load significantly on PC2, do the first five PC1loading variables appear to reflect a construct that is, in principle, different from those loading on PC2?).
Further considerations are that we should employ the minimum eigenvalue of ≥1.0 for each PC (especially since if a PC has an eigenvalue of <1, then an average of <1 X variable contributes towards it, and hence it is of no significance or consequence!), and also we should realistically determine the ‘break’ in the curve from the Scree plot acquired (often, these are unclear!). Since there can frequently be more than one such break in a Scree plot, the consideration of more than one possible solution may be required. It is also accepted that the combined retained PCs should account for a minimum of 70% of the cumulative variance; indeed, if <70% is covered, then it is recommended that alternative models with a larger number of PCs should be considered, perhaps those also including quadratic and/or multinomial representations of one or more of the potentially very many X variables.
1.2.7 Varimax Rotation
The rotated factor (PC) pattern should demonstrate a relatively ‘simple’ structure, i.e. (1) a range of the X variables should exhibit high loadings on only one retained PC, and nearzero ones on further PCs, and (2) most retained PCs or factors should demonstrate relatively high PC loadings for some X variables, and hopefully nearzero ones for the remainder.
Both PCA and FA primarily extract a series of components (otherwise known as factors) from a dataset, and these factors are predominantly orthogonal, and their relative importance is ordered according to the percentage of the total variance of the original dataset that these components account for.
However, generally only a (small) subset of these components is retained for further consideration, the remaining ones being considered as either noncontributory or nonexistent (for example, in ^{1}H NMRlinked metabolomics analysis, they may arise from measurement error or ‘noise’).
So that we can interpret the PCs/factors that are considered relevant, it is important that the preliminary selection step is succeeded by a ‘rotation’ of the PCs that were primarily isolated and retained. There are two major classes of rotation employed, specifically orthogonal (in which the newly constructed axes are orthogonal to each other), and oblique (in which there is no requirement for the new axes to be orthogonal to each other). Since the rotations are conducted in a subspace (known as the component or factor space), these new axes are always explicable by a lower level of variance than the original components/factors (which are, of course, optimally computed), but the portion of variance explicable by the total subspace following rotation remains the same as it was prior to rotation (i.e. only the variance partition has been modified). Since the rotated axes are not defined according to a prespecified statistical inference, their major focus and advantage is to assist interpretation of the results acquired.
Since these rotations take place in a subspace (specifically the retained component/factor space), it must be optimally chosen, since this subspace selected powerfully influences results arising from the rotation. Therefore, a range of sizes for the retained factor subspace should be explored in order to evaluate the robustness of the rotation's final interpretation.
In general, the initial matrix is not interpreted, and the PCs/factors are rotated to generate a more parsimonious solution, in which each variable has a new combination of high and low loadings across the factors involved. The interpretation of this form of PCA or FA involves an identification of what is common amongst the variables which load highly on a particular component/factor (perhaps a chemopathological disturbance in a selected metabolic pathway), and what distinguishes them from those having low loadings on that particular one.
1.2.8 Example Case Study
In this experimental PCA case study example, I attempt to relate a salivary ^{1}H NMR metabolomics dataset to a single classification model, the classification being the presence or absence of a particular oral health condition (i.e. healthy controls versus active disease qualitative classifications). The original dataset consisted of 209 ‘intelligently selected’ ^{1}H NMR bucket variables, and from Figure 1.3(a) it can be clearly observed that there are no visually apparent classification distinctions observed in threedimensional (3D) interactive scores plots of PC3 vs. PC2 vs. PC1. However, three further, highly correlated ‘falsedummy’ latent variables (with scores ranging from 0 to a maximum value of 10) were then introduced into the experimental design model (correlational details of which are provided in Table 1.2), and supplemented to the original dataset in a stepwise fashion, so that there were 210, 211 and finally 212 explanatory (X) variables in the ‘revised’ dataset; for these added variables, it was ensured that each one was strongly (Pearson) correlated to an assigned binary ‘disease classification’ score of 0 for no disease activity (i.e. the healthy control group) and 1 for the oral disease classification group. There was only a relatively small number of significant Pearson correlation coefficients between these dummy variables and those of the original, unsupplemented ^{1}H NMR bucket variables.
Correlation matrix (Pearson): .  .  .  

Variables .  X1 .  X2 .  X3 .  Disease score . 
X3  0.9238  0.9412  1  0.8908 
Disease score  0.9780  0.9448  0.8908  1 
X2  0.9723  1  0.9412  0.9448 
X1  1  0.9723  0.9238  0.9780 
Correlation matrix (Pearson): .  .  .  

Variables .  X1 .  X2 .  X3 .  Disease score . 
X3  0.9238  0.9412  1  0.8908 
Disease score  0.9780  0.9448  0.8908  1 
X2  0.9723  1  0.9412  0.9448 
X1  1  0.9723  0.9238  0.9780 
Figures 1.3(b)–(d) exhibit interactive 3D scores plots of the models in which there were 1, 2 and 3, respectively, of these ‘falsedummy’ variables added sequentially, and the classification status was either ‘healthy control’ or ‘oral diseasepositive’ patients. Clearly, the introduction of these three new variables gives rise to major differences in the levels of discrimination between the two disease status classifications. Indeed, the level of ‘BetweenDisease Classifications’ distinction between these four datasets clearly increases with increasing number of ‘falsedummy’ variables included, although the inclusion of only one of them gives rise to a satisfactory level of discrimination between them.
1.2.9 Examination of a Wider Range of Components
A further important point for consideration is the knowledge that, more often than not, one or more of the PCs or factors which account for only a relatively small percentage of the overall dataset variance can be responsible for and hence reveal major distinctions between the subsequently specified supplementary PCA classifier variables, and may also serve to offer much valuable information regarding specific biomarkers available in the dataset. Indeed, many researchers involved in the metabolomics research area simply investigate and plot the first few (strongest) PCs against one another in an attempt to seek and detect any significant discriminatory potential amongst the classification groups, and, in view of this, are sometimes disappointed! The author therefore recommends that investigators should first perform datasetconstrained univariate significance testing procedures, i.e. ttests and ANOVA, the latter containing and also considering as many latent sources of variation as possible, together with those ascribable to their possible first or secondorder interactions; this constraint can be implemented via the attainment of a Bonferronicorrected p value for testing the significance of the source of variation of major interest, that ‘BetweenDisease Classifications’, for instance.
In this manner, researchers may select putative metabolic biomarkers which exhibit the most highly significant differences ‘BetweenClassifications’ or otherwise, and then search for these and their loadings on (contributions towards) PCs up to the first 10, 15 or even 20 of these PCs (linear combinations), provided that they all have eigenvalues ≥1, and that they all significantly contribute towards the total dataset variance, albeit in a relatively small manner. This approach, which involves a relatively unique combination of both univariate and MV analytical approaches, serves to inform us about small numbers of metabolic biomarkers which are not included as major or substantial contributions to the first few PCs (PC1, PC2 and PC3, etc.), and one or more of the biomolecular signals loading on which may also serve as major discriminatory indices between two or more disease classification groups.
Figure 1.4 shows an example of this, which exhibits a plot of PC9 versus PC8 from an experiment in which three predictor ^{1}H NMR ‘intelligently selected’ bucket intensities loaded substantially on PC8 and PC9 (each with loading values of ca. 0.40 on PC8, and percentage contributions of 12.1, 12.1 and 11.8% towards it, i.e. these three variables alone accounted for >35% of the total variance of this component); this experiment also involved an exploration of the metabolic classification of human saliva specimens into two classification groups (for this example, healthy control participants versus those with a further known oral health condition); the eigenvalues of PC8 and PC9 were 3.20 and 2.91, respectively, i.e. approximately three explanatory X predictor variables loaded on each one); there were 204 potential explanatory X variables and a total of 428 salivary supernatant samples involved in this model system. Allowing for the presence of a number of ‘outlier’ samples (as noted above, this serves as an efficient means of ‘policing’ clinical trials, for example the detection of samples containing exogenous agents such as drugs, oral healthcare product agents or further ‘foreign’ exogenous agents, for example, in participants who are not rigorously adhering to clinical trial protocols), it is clear that there is a major distinction between the two disease classification groups, with the disease one having a centroid with positive scores for PC8 and PC9, and the healthy control one with a centroid which has a negative score for both these PCs (95% confidence ellipses for these two classification groups are also exhibited).
1.2.10 Consideration of Type I (FalsePositive) Errors
If, as in the PCA, PLSDA, Partial Least SquaresRegression (PLSR) techniques, or a range of further MV analytical methods which are based on the preliminary computation of a Pearson correlation (and covariance) matrix, we primarily generate such a matrix of, for example, 200×200=4000 Pearson correlation coefficients for an experimental design incorporating 200 predictor (X) metabolic variables which are generated via ^{1}H NMR or LCMS analysis, and if we specify a significance level of p=0.01, then we will achieve an average of 40 stunningly significant correlations purely by chance alone! Furthermore, if our p value was more liberally set to a value of 0.05, this probabilitymediated number of significant correlations would escalate to no less than 200! These considerations are outlined in more detail in Chapter 2.
1.2.11 Determinations of the Suitability of MV Datasets for Analysis with PCA and FA
How exactly do we determine whether or not PCA or FA is appropriate for application to our MV metabolomics, proteomics or genomics datasets? Well, firstly we may employ the Kaiser–Meyer–Olkin (KMO) measure of sampling accuracy, and this method serves to provide essential information regarding whether or not the magnitudes of the partial correlations measured amongst variables are sufficiently low. If two variables share a common PC or factor with a series of further variables, their partial correlation coefficient (ŕ_{ij}) will be low, and this criterion will serve to inform us of the ‘unique’ variance shared between them [however, readers should note that such partial correlations, and their further application to the analysis of MV datasets, for example as in Gaussian Graphical Models (GGMs), are outlined in more detail in Chapter 3].
Critical considerations include whether or not the relationships existing between the predictor (X) variables are strong enough, and are we therefore confident in proceeding with the application of a PCA or FA model to the dataset? Indeed, this KMO test represents an index for comparisons of the magnitudes of the observed (Pearson) correlation coefficients to those of the partial ones [eqn (1), in which r and ŕ depict the Pearson and partial correlation coefficients respectively, the latter equivalent to r_{ij •1,2,3,…k}]. Hence, if ŕ^{2}_{ij}≈0, then the KMO statistic≈1 and we may conclude that the predictor variables explored serve as representative measures of the same PC or factor, whereas if ŕ^{2}_{ij}≈1, then the variables involved are not considered to be expressing measurement of the same PC or factor. Hence, high values attained for the KMO statistic indicate that application of PCA or FA models to datasets acquired are acceptable approaches for their analysis (an absolute minimum value of 0.50 is preferable). Generally, such models are considered exceptional if its value is >0.90, very good if its magnitude lies between 0.80 and 0.90, good for values between 0.70 and 0.80, mediocre for values within the 0.50–0.70 range and unacceptable if <0.50.
However, a further method of determining the strength of the relationships amongst the predictor variables is Bartlett's Sphericity Test, which simply evaluates the null hypothesis that the variables present within the whole population's correlation matrix are uncorrelated, i.e. that the intercorrelation matrix is derived from a population in which the X variables are noncollinear (specifically an identity matrix). This test computes the determinate of the matrix of the sums of products and crossproducts which generate the intercorrelation matrix. This matrix determinate is then tested for its statistical significance via a Chisquared statistic.
1.3 Partial Least SquaresDiscriminatory Analysis (PLSDA)
Typical metabolomics profiling investigations involve two or more classifications of participants (human, animal, plant, cell or otherwise), and when there are only two of these, they can be divided into disease case versus healthy control or perhaps treatment versus untreated control groups. These investigations can be performed in either an exploratory or a predictive manner: the former is focused on whether the dataset acquired contains a level of information which is sufficient for us to discriminate between the two classifications, whilst the latter's objective is to determine whether or not we can predict whether an unknown sample can be successfully classified into one of these two (or more) groups, and, if so, to what level of confidence, exactly?
Partial Least Squares Discriminatory Analysis (PLSDA) is based on the Partial Least Squares (PLS) model (Appendix I) in which the dependent (Y) variable represents membership of a particular classification (e.g. diseased versus healthy control, etc.), and since common metabolomics experiments contain a very large number of resonances, signals or peaks representing a multitude of biomolecules (at least some of which may serve to be valuable biomarkers of diseases and perhaps also their activities), these considerations can sometimes present many perplexing choices for mathematical modelling, validation and CV options. Indeed, as noted above for PCA, the minimum sample size required for a satisfactory model increases substantially with the number of variables monitored, and since the number of samples provided for analysis and/or sample donors is frequently somewhat or even much lower than the number of predictor (X) variables incorporated into the model, this leads to many validation challenges. These problems arise in view of the increasing likelihood of models with (apparently) effective group classifications which are generated purely by chance alone (via the now increasingly recognised ‘overfitting’ problem)!
Hence, a recommended means for the MV analysis of any metabolomics dataset is to employ a (relatively large) series of randomly classified datasets in order to establish the reliability and precision of the model's predictive capacity, and Westerhuis et al. (2008)^{12 } have provided some valuable and convincing examples of the importance of classifying metabolomic datasets according to a series of selective rules which involve the prior random permutation of disease and/or treatment classifications for PLSDA models and, consequently, related binary score values for PLSregression (PLSR) ones.
Since far too many metabolomics investigations seem to involve far too many predictor (X) variables, and perhaps far too few analysed samples, these arguments are very true and valid; indeed, many of the studies reported in the relevant scientific literature can involve as many as 200–1000 or more X values, and perhaps as few as 20–40 samples for such MV analysis! Under such circumstances, employment of the above PLS models will nearly always give rise to a perfect clustering/separation of the two (or more) classifications investigated. A reputable and perhaps famous quote by Snedecor in 1956^{14 } is that there will be a perfect predictive fit between a single dependent variable and six ‘predictor’ variables in an experimental model which also contains only six samples, with measurements provided for the dependent variable and also the six independent ones. This is probably the best statement regarding the overfitting ‘curse of dimensionality’ that I am aware of, and, unlike many metabolomics investigations, this example only involves the fitting of six predictor variables to an equivalent number of cases. Therefore, metabolomics experimenters should carefully consider this fact when attempting to ‘fit’ as many as hundreds or even thousands of X variables to the bioanalytical profiles of as few as 20 or so biofluid or tissue biopsy samples collected!
The validation and CV of PLSDA and PLSregression (PLSR) models is indeed an area of serious concern, and a number of pertinent reviews published have revealed that acceptable methods for processes are either lacking or not even attempted,^{15–18 } and have also outlined the most important problems associated with it. Indeed, as delineated above, one of the most important of these considerations is a very limited sample size; in view of the high economic cost of acquiring multicomponent bioanalytical profiles on biofluids and/or tissue sample biopsies (including the collection of the sample itself), this is very often the case!
In order to effectively evaluate the results acquired, CV processes can be performed; however, Anderssen et al. (2006)^{19 } have noted that very often the methods selected for these are either erroneous or, for that matter, not performed in correct or acceptable manners. Indices which are frequently employed to quantify the effectiveness of classification selection criteria include (1) simply the number of misclassifications, (2) the ubiquitous (and sometimes mysterious) Q^{2} value, which indicates the variation of predicted values and hence the quality of prediction (the range is 0 to 1, where values of 0.50 and 0.90 are considered good and excellent, respectively), and (3) a wide range of criteriadetermining sums and/or ratios of correct and falsepositives and/or negatives of what is classically known as a confusion matrix. Furthermore, it is also common for researchers involved in this area to provide the Area Under the Receiver Operating Characteristic Curve (AUROC). If this value is close to 1.0, then the classification criteria are viewed as ‘good’ or even ‘excellent’, whereas if it is close to 0.50, then the classification function employed is considered to be of very little or zero use. However, it is important to note that there remains a major problem with all of these model efficacy evaluation measures: the value corresponding to a high level of classification efficacy is unknown, and p values for the statistical significance of the discriminatory effects observed are rarely provided (in any case, such a value is critically dependent on the number of samples placed in both the ‘training’ and ‘test’ sets).
Most of us are already aware that models constructed from routine or even especially selected CV techniques can contain differing numbers of PLS components and, for that matter, differing ‘significant’ predictor variables with different loading coefficients for each subset of these models, and with the exception of a number of recent developments in this area (particularly those involving random permutation testing and determinations of the statistical significance of such evaluations), for example Westerhuis et al. (2008),^{12 } there are currently no or very limited acceptable criteria for this. This is, of course, of much significance regarding the transference of such information from the subset of models to the full dataset, and, more importantly, for its future application to the diagnosis, and perhaps severity determination and monitoring of the chemopathologies of disease processes and their treatment regimens. Moreover, what is the clinical significance of these models?
1.3.1 Case Study Describing an Example of PLSDA ‘Overfitting’
Here, a typical example of the ‘overfitting’ of datasets by PLSDA to a model employing 222 predictor X variables (intelligently selected chemical shift bucket intensities of the ^{1}H NMR profiles of human urine, normalised and standardised a priori) and only 20 biofluid samples is described. This may appear to be statistically unacceptable to many readers (and of course, it is!), but this form of experimental design and MV analysis is not that uncommon in the scientific/biomedical literature!
For this experiment, the classification groups of the healthy control and disease classifications (10 in each group) were randomly permuted 30 times, and then PLSDA was performed on each of these permuted classification status sets.
Figure 1.5 shows PLSDA scores (t_{2} versus t_{1}) plots for six of the PLSDA sample classification permutations tested in this manner. Clearly, there are very high levels of sample classification clusterings and hence discrimination notable for each of these examples, and these results provide ample evidence for the overfitting of a very large number of predictor (X) variables (222) to a statistically small sample size (n=20) using this technique; the acquisition of falsepositive results in this manner is not that unusual in the metabolomics/scientific literature. However, out of the complete set of 30 random permutations of the sample classification status, models with only a single PLSDA component were constructed in 13 cases (Q^{2}=−0.070±0.272, mean±SD), and those with two components were built in 14 cases (Q^{2}=−0.075±0.308 and 0.081±0.328 for the first and second components, respectively). Moreover, a further one of the sample classifications had a total of five components (with Q_{1}^{2} and Q_{2}^{2}=0.273 and 0.403 for the first and second components, respectively), and another had as many as seven (with Q_{1}^{2} and Q_{2}^{2} values of 0.005 and 0.061, respectively)!
1.3.2 Permutation Testing
Similarly, for the above case study, a series of permutation tests was employed in order to explore relationships between the full set of 222 ^{1}H NMR bucket ‘predictor’ variables and the hypothetical disease classification status. This rigorous testing system serves to determine whether or not the disease status classifications of the study participants is significantly improved over that arising from any other random classification of these groups; the class labels of the healthy control and disease classifications are permuted, and then randomly assigned to different patients. With these ‘incorrect’ disease class labels, a classification model was again computed. Hence, the rationale was that for these ‘incorrect’ class labels, the computed model for classification purposes should be ineffective at class prediction (since the groups are generated randomly, the null hypothesis is that there are no differences between them). With repetition of this permutation test many times (2000 times individually for each of the initially randomly assigned class labels, i.e. an overall twophase randomisation process), a null distribution of classifications which are expected to be insignificant was formed, and if the computed pseudoF statistic lies outside at least the 95% or 99% confidence bounds of this distribution for ‘real’, genuine classification labels, then it could be concluded that there is a significant (linear) relationship between the X predictor variables and classification status.
For 52 out of a prior 56 randomlypermuted class labels, the pseudoF value statistic was not significant (i.e. p>0.050: Figure 1.6); the four that were significant had p values of 0.0495, 0.0375, 0.024 and 0.011). Therefore, with a significance value of 0.05, we can expect, on average, approximately 2.8 of the statistic values to be significant by chance alone, and the value of four significant values obtained here is not that far off this expected figure!
In a further PLSDA experiment, the above random permutations were also performed in order to test the ‘overfitting’ of the model to a total of 10 sample donors (patients) included in the study, again with 222 explanatory X variables and only n=20 samples collected therefrom. A typical result arising from this further investigation is shown in Figure 1.7; the t_{2} versus t_{1} scores plot obtained reveals that quite a high level of distinction is achievable between each of the 10 participants involved by PLSDA overfitting in this experimental design which contains many more X variables than samples available (>10fold in this case)!
1.3.3 Procedures for the Validation and CrossValidation of PLSDA Models
As indicated above, although used very infrequently, a critically important aspect of CV processes is permutation testing, which usually involves the analysis of a very large number (say 500–500 000, or even more) of versions of the dataset with randomlyassigned classification labels. In this manner, a random distribution for the null (H_{0}) hypothesis that no differences exist between the two (or more) classifications is attained, and hence we are able to test the significance of any key differences observable, MV or otherwise.
The major advantage offered by such permutation testing is that via the analysis of a very large number of versions of the complete dataset (say, up to 10 000 or so) with randomly assigned classification labels, a reference (null or H_{0}) distribution is acquired, and if our computed statistic (e.g. a pseudoF ratio statistic value, as employed in redundancy or partial redundancy analysis) lies within this distribution without a significant p value (<0.05, <0.01 or otherwise, a parameter preselected by the researcher), then we can conclude that there is no evidence available for a significant departure from the null hypothesis and hence there is not a significant influence of the disease classification and/or an administered therapeutic regimen or toxic agent, the latter in the case of animal model experiments performed to investigate the toxicological insults and effects of selected agents (on target organs such as the liver or kidney, for example) on the metabolic profile of the biofluid or tissue biopsy sample evaluated in this manner.
The performance of such permutation testing has revealed that the erroneous application of CV methods can often give rise to too (and far too) optimistic classification outputs. A range of previous publications focused on this area have indicated this danger and sometimes confirmed it via the performance of detailed assessments, e.g. Westerhuis et al. (2008).^{12 } Indeed, when employed incorrectly, a result consisting of only a small number of misclassifications is obtainable. Moreover, the application of such permutation testing to the now highly utilised PLSDA model quality parameters such as Q^{2} and AUROC values, together with the misclassification rate(s), provides null hypothesis (H_{0}) distributions of such values that may be obtained in the case of no or nearzero differences observed between two (or more) classification criteria, and in this manner we have the ability to determine which particular CV index lies significantly outside this hypothetical permuted distribution. In this manner, we may be able to propose the application of a range of model systems which differ only in the slightest sense rather than just a single one, and as such we can derive a series of estimates of classification memberships. Indeed, Westerhuis et al. (2008)^{12 } have argued strongly that such an extensive series of these model systems should be employed as a powerful confidence measure and reassurance index for such classification membership assignment tasks.
There is a major requirement for the employment of CV models in view of the frequently (and increasingly!) small numbers of biofluid or tissue biopsy samples available for such metabolomics investigations, especially since their prior segregation into ‘training’, ‘validation’ and ‘test’ sets is, for a large number of studies, just not possible. Hence, selected CV techniques serve to provide a more realistic use of datasets tested in this manner (although we should, of course, note that it is required to expose the complete modelling process to CV strategies in order to yield a reliable error rate estimation). A further important consideration is that the classification index or indices predicted should not, under any circumstances, be employed for model development.^{14,15,18 } Although this particular stringent requirement has been noted in a relatively large number of publications, unfortunately it remains a very uncommon practice!
So that we may be confident with the nature of and results acquired from the CV method performed, the dataset should be divided into training, optimisation (validation) and test sets; a model is developed from the training and optimisation datasets, and the test set is then employed solely for determining the model's performance.
Repetition of such a process in a manner involving the inclusion of each sample in the test set only once allows a realistic estimate of prediction error which is representative of future samples entering the model test system. In order to ensure the complete independence of the test set, samples therein should remain exclusive to all operations involved in the model's development, including prior dataset pretreatment systems employed for the ‘training’ set, for example transformations, normalisation, scaling and standardisation, etc.
In the single crossvalidation (1CV) method, which is employed for an extensive range of systems and applications, a number of samples (or sample donors) are removed from the complete dataset and utilised as a validation set. The remaining samples which form a training set are then employed to generate a whole series of classification models with the number of PLS components ranging from 1 to perhaps 10 or 20, although the latter higher component range can sometimes be a little unlikely or unrealistic! Subsequently, a predictive capacity and prediction of all validation set members is provided (and the predictive errors of all these developmental models are stored for future use). Henceforth, a new patient or participant dataset is introduced, and subsequently this process is repeated for these up to the stage where all of them have been placed in the validation dataset once and only once, and in this manner the total predictive error for all models throughout all test samples is completed; that with the lowest predictive error then serves as the optimal one for further development and, hopefully, application to real test samples! The predictive errors acquired via the employment of this technique are then utilised in order to compute the Q^{2} value and the misclassification rate. For this particular CV model, it should be noted that samples originally incorporated into the validation set are also utilised to determine the most effective model parameters, and therefore they do not remain completely independent, which represents an important requirement for an acceptable CV model.
Crossmodel validation (i.e. double crossvalidation, abbreviated 2CV) has been put forward as a system suitable for dealing with problems arising from the dependency existing between the prediction error for new samples and the model optimisation parameters. In this system, one series of samples is completely isolated as the ‘test’ set, and the remaining ones then undergo a single CV testing process, in which they are also subdivided into training and validation sets (the single CV regimen again giving rise to an optimum number of PLS components). The optimised model derived therefrom is then employed to predict the classification status (disease class or otherwise) of those samples (biofluid, tissue biopsy, etc.) placed into the test set. Subsequently, the whole process undergoes a repetitive construct until all the samples have been placed in the test set once, and only once, and it should also be noted that it is of much importance to select the validation samples in a random manner in order to further optimise the inclusion of differing combinations of validation and training sets for each newly selected test set. In this manner the final model will have been constructed in the complete absence of the test set, and hence its predictive capacity remains independent of the model optimisation regimen utilised.^{20 }
In the qualityoffit (FIT) model, the single CV basis of the above 1CV technique seeks the optimum number of PLS components, and a PLSDA model is then constructed from all samples available with this optimal number. Subsequently, the classification groups of all of these samples are determined (or estimated) with this particular model; this, however, represents a resubstitution rather than an acceptable prediction process, and in this manner Q^{2}FIT, the number of misclassifications, and AUROC values may be obtained.
Therefore, overall the variability of the estimated parameters, and their influence on the model's predictive capacity, are evaluated via the 1CV method. However, the 2CV technique offers advantages since it is also provides an assessment of the variability of metaparameters and their overall contribution towards the predictions obtained; the classifications predicted for the samples analysed are only completely independent of the remaining dataset when this particular technique is utilised.
Figure 1.8 shows a PLSDA analysis of a very large thyroid disease dataset comprising a series of explanatory variables, including the blood serum concentrations of thyroxine (T4) and thyroidstimulating hormone (TSH). Results obtained revealed very clear distinctions between the three classes of disease [healthy controls (euthyroid), hypothyroid and hyperthyroid patients]; the validation process involved the prior removal of approximately onethird of the samples, and the PLSDA model was then built on the remaining twothirds of them. Validation of the model in this manner gave rise to an excellent agreement between the predicted sample identities and their known ones (mean classification rates of 100% for euthyroid and hyperthyroid patients, and 98.9% for the hypothyroid group).
Also shown are results derived from a Partial Least SquaresRegression (PLSR) model in which the hypothyroid, euthyroid and hyperthyroid disease classifications were assigned arbitrary scores of −1, 0 and +1, respectively. Again, the model evaluated demonstrated an excellent predictive capacity.
1.3.4 Attainment of the Final Calibration Model
Selection of the predictive capacities of each of the separate testing systems generates a range of somewhat differing models, with differing numbers of ‘biomarker’ variables and also perhaps components, which arise from the random selection of some specimens into the ‘training’, ‘validation’ and, where appropriate, ‘test’ sets, each of which has differing contributions towards each parameter evaluated: however, this approach serves to complicate the optimisation of a final ‘diagnostic’ model system.^{14 } Indeed, the precision of such a final predictive model system should always be greater than those generated during the CV regimen (i.e. those developed on sample subsets), and hence at this stage the full applicability of the final calibration system is not required.
Notwithstanding, as an alternative to such final models, a whole series of these incorporating one per test set can be made available in order to metabolomically classify future test samples, and therefore a group of possible classifications for each one may be available, rather than a single one [such computations are likely to involve the consideration of further lateral ‘predictor’ variables such as gender, age, BMI and, where appropriate, length of treatment (if any), etc.]. Henceforth, this group can serve to provide ‘mean level’ predictions based on the individual predictive models developed, and therefore appropriate confidence intervals (CIs) for the overall predictive capacities can be developed, and their stabilisation and stabilities throughout research work performed. Indeed, the ‘bagging’ procedure of Breiman^{21,22 } is of much relevance here.
1.3.5 Quality Evaluation Processes
Since there is currently a range of criteria employed to aid determinations of a preselected classification of ‘unknown’ samples, including percentage classification successes based on the confusion matrix (and consisting of numbers of falsepositives and negatives, together with true positives and negatives^{14 }), it is necessary for us to be clear about the particular measures adopted for this purpose, and also their possible influence (facilitatory or adverse, for that matter), on the classification of samples collected from patients participating in future clinical or metabolomics investigations. For a particular class of diseasepositive participants, the proportions/percentages of true positives is known as the sensitivity, whilst those of falsepositives is referred to as the (1specificity) parameter, and a combination of these two criteria gives rise to the socalled Receiver Operator Characteristic (ROC) curve. Indeed, the ROC curve comprises a plot of sensitivity versus (1specificity), and this relationship is often employed to determine the successful (or unsuccessful) performance of a clinicallyrelevant MV (e.g. a biomolecular concentration index) dataset or alternative measurement system. Of course, sensitivity is defined as the number of correct (true) positives found expressed as a percentage of all the available positives (i.e. those with a particular disease classification or, alternatively, response to a particular treatment, etc.). Sensitivity values lie between 0 and 1, with 0 being no success whatsoever, and 1 a 100% classification rate. The (1specificity) index, however, represents the number of falsepositives expressed as a percentage of all such negative (diseasefree) values (i.e. those for a ‘control’, healthy participant dataset). For an effective and reliable model system, sensitivity values should be close to 1.0, although the specificity should also be close to this particular extreme value, so that (1specificity) remains close to 0 [the classification boundary preset by the investigator(s) determines the overall specificities and sensitivities of the model system tested]. A modification of these selection parameters may give rise to an elevation in the number of true positives, although the number of falsepositives will also be enhanced, and viceversa. Hence, the classification boundary of the model system tested determines the effectiveness of a ROC curve, which delineates both the specificities and sensitivities of models with perhaps differing classification barriers or thresholds.
Indeed, we may select values other than 0 as the classification boundary cutoff value, and the choice of a slightly lower value may increase the sensitivity of the +1 (diseasepositive) group, although this is inevitably coupled to an alleviation of the (1specificity) parameter. Therefore, the overall classification quality measurement utilised is the area under the ROC curve (AUROC) value, which is 1.0 for an ultimately perfect class distinction, and 0.50 if there is absolutely no separation detectable or present.
Q^{2} values, however, represent predictive capacity default parameters, which are commonly employed in PLSDA investigations, and are targeted at determining the efficacies of classification label predictions from newly derived datasets. Q^{2} is defined in eqn (2), in which SS represents the fraction of the meancorrected sumofsquares of the Y classification codes explained for each PLS component obtained, and PRESS the sum of squared differences between the observed and predicted Y values for all biofluid or tissue biopsy specimens incorporated into the test system (further details are provided in Appendix I). As might be expected, the optimal value of Q^{2} (1.0) is extremely difficult to attain in practice in view of considerations that (1) the requirement for it is that the classificational
prediction of all such samples should be exactly equivalent to their class labels, and (2) the always present inherent variation (perhaps that ‘BetweenParticipants’, ‘BetweenSampleswithinParticipants’ and/or further latent or ‘hidden’ variables) nested within the same classification criterion. Therefore, the Q^{2} value derivable depends not only on the ‘BetweenClassifications’ variability, but naturally also on the ‘BetweenSampleswithinClassifications’ one, and this renders it somewhat difficult to achieve a Q^{2} value which is representative of a high classification prediction capacity, and hence it is highly recommended to employ a series of permutation tests in order to evaluate the distributional status of such modeldependent Q^{2} values in the complete absence of any influential or constraining effects exerted by two (or more) classification criteria, which may (or may not) exert significant effects on this random permutation distribution.
However, since the AUROC value, and also the number of misclassifications found, reflect simple ‘extent of classification’ error measurements, and only serve to inform us of the numbers correctly and incorrectly classified, they are clearly of less value than a permuted distribution of values which arises from the null hypothesis of no effects exerted by the ‘BetweenClassification’ factor or factors. Indeed, Q^{2} is a prediction error measure that, perhaps fortunately, is able to distinguish between correct and incorrect classifications; for example, a class prediction value of +0.90 is penalised more so than one of +0.60 for a correct class label of 0, i.e. some estimated classification status values are more equal than others! Notwithstanding, the above AUROC and number of misclassification measures noted above serve to view these prediction errors as exactly the same – i.e. in these cases, all incorrectly classified errors are equal!
In view of the large number of variables available to classify the disease (or alternative) status of biofluid or tissue biopsy specimens, the MV metabolomics data analysis arising therefrom remains a highly complex process. Indeed, there remains a very wide range of modelling solutions available to effectively ‘solve’ these problems, and hence ‘overfitting’ is a very common example available in the scientific literature, i.e. the model employed appears to classify ‘training’ datasets very efficiently, but its application to samples collected in future, corresponding investigations has a very poor or perhaps virtually zero classification ability (please note the examples given above in Sections 1.3.1 and 1.3.2)! Clearly, such studies are opportunistic, highly presumptive and largely hypothesisdriven arguments, which eventually fail to offer the high level of merit proclaimed from the original modelling MV experiments performed. Indeed, as a highly typical example, the PLSDA scores plots, which are documented in a very significant proportion of disease status classifications in metabolomicsbased publications, may represent highly exaggerated or overoptimistic visions of such classification differences (however, they may reveal some level of significant ‘withinclassification’ differences, which perhaps were unknown to the investigators prior to performing the analysis). Indeed, results arising from putative classification studies of this nature employ predictions rather than fitted values (e.g. PLSDA scores) as a foundation, and the failure to perform one or more of the validation, CV and corresponding permutation monitoring of the dataset acquired will not provide researchers with a high level of confidence regarding the results acquired!
1.3.6 CostBenefit Analysis (CBA)
Briefly, this procedure can be performed in order to select the optimal number of ‘biomarker’ variables, and also to determine the diagnostic benefit of adding additional ones (although the cost of, for example, employing 30 rather than 5 biomarker variables could represent a 6fold increase, the diagnostic benefit derived may be limited, with improvements of perhaps only a few per cent in terms of those correctly classified). Indeed, successful models may be formed on only the top 5 or so ranked explanatory (X) metabolic predictor (biomarker) variables, or even less than this number.
In this work the author utilised XLSTAT2013, MetaboAnalyst 2.0, MetATT and ACD Spectrus Processor 2013 software.
Appendix I
Partial Least SquaresDiscriminatory Analysis (PLSDA)
Partial least squaresdiscriminatory analysis (PLSDA) represents a regressionextended class of PCA, which involves the derivation of latent variables (analogous to principal components), which maximise the covariation between the monitored dataset(s) (i.e. conventional or ‘intelligently selected’ ^{1}H NMR spectral bucket areas) and the response variable which it/they is/are regressed against. PLSDA represents a special form of PLS data modelling which, in the case of a significant discriminant function, has the ability to distinguish between known or established classifications of samples in a calibration set, and is focused on seeking a range of discriminatory variables and directions in a greater than bivariate (i.e. multivariate) space. This procedure involves the computation of an indicator matrix of potential classification (predictor X) variables for each classification group incorporated in the calibration dataset [for a two classification system, each group may be assigned a value of 0 or 1 (or −1 and +1) according to which particular class a study participant who provides a ‘diagnostic’ biofluid or tissue biopsy sample belongs].
Like PCA, Partial Least Squares (PLS) performs a dimensionality reduction of the X matrix, but also relates X variances to that of Y contained in a Y response matrix. The matrices are simultaneously decomposed, exchanging respective scores information so that the technique maximises their covariance. Components that successfully maximise any remaining covariance are then generated, the optimal number defining the model dimensionality. A PLSDA analysis involves a Y matrix containing class information (and hence it is a supervised technique), the biomolecule concentrations or proportional NMR, LCMS or GCMS intensity measurements (X matrix) being related to nominal categorical codes (Y column dummy matrix) by an equivalent correlation matrix B [eqn (1)]; the
analysis can therefore maximise the correlation (or covariance) between X and Y. The X and Y matrices are converted to eqns (2) and (3), in which T and P represent the scores and
loadings matrices for X, respectively, U the corresponding Y scores matrix, Q^{T} the y weighting matrix, and E and F the residual matrices which accommodate information not related to X/Y correlations. The X weights w (describing the variation in X correlated to the Y class information, i.e. through their covariance, as well as information on the variation in X not related to Y) are also employed for calculating T [eqn (4)]. The W^{*} matrix is transformed from the original W matrix so that it is
PLS componentindependent, since the X scores T are linear combinations of the X variables, and when multiplied by P they will essentially return the original variables (with small E values). Equations (2) and (4) can then be combined to yield eqn (5) [i.e. a modified form of eqn (1) which allows for residuals], in order to set up the regression model according to eqn (6).
A range of output parameters can be generated from PLS analytical software packages, including goodnessoffit parameters such as the fraction of the meancorrected sumofsquares (SS) of the Y codes explained for each generated PLS component, i.e. R^{2} [eqn (7)], where RSS represents the
fitted residual sum of squares, i.e. the sum of the squared differences between the observed and fitted y values [eqn (8)].
The presence of many, potentially highly correlated, X predictor variables indicates the possibility of data overfitting, and hence there is a requirement to test the model's predictability for each PLS component. However, model validation through deduction of the number of significant PLS components can be determined via a ‘leaveoneout’ CV method, in which data for one sample is removed from the model, and the predicted classification groups or analogous Y value codes are then compared with those of the removed sample, the process being repeated until all samples have been left out once. The predictive residual sum of squares (PRESS) is the sum of the squared differences between the observed and predicted y values for the CV process [eqn (9)],
and the fraction of total variation in the Y codes that can be predicted by each PLS component is defined by Q^{2} [the ‘crossvalidated R^{2} value’, eqn (10)].
The number of components that cause a minimum computed PRESS value (within a limit of 5% between each subsequent component) is noted, and this number can then be preset in the developing model program for further computations.
Appendix II
Brief Summary of Further Forms of Discriminatory Analysis (DA) Available
There is a wide variety of approaches for discriminant analysis, and many of them are still not well established amongst the metabonomics research community. These are of two forms: One and Twoclass classifiers. Oneclass classifiers allow us to build models of varying complexity around each class separately (for example, between two or more disease classification groups to be examined), from, for example, their ^{1}H NMR spectral profiles, so that researchers can predict whether a patient has a disease, and/or belongs to a disease subgroup, to a given specified level of confidence. They can also permit us to determine how well modelled a particular class is. Using Receiver Operator Characteristic (ROC) curves, prediction thresholds can be computed in order to determine optimum conditions for the minimisation of falsenegatives or falsepositives. However, Twoclass classifiers attempt to form a ‘hard boundary’ between two (or more) classes (samples close to the boundary are somewhat ambiguous and therefore difficult to classify), and for each sample a model stability can be determined. In metabolomics investigations, a variety of statistical methods for validation can be utilised in order to ensure that the models are sound, and methods employable include Linear Discriminant Analysis, Quadratic Discriminant Analysis, Partial Least Squares Discriminant Analysis, Learning Vector Quantisation and Support Vector Machines (SVMs), in one or twoclass formats (where appropriate).