Chapter 1: Univariate and Multivariate Statistical Approaches to the Analysis and Interpretation of NMR-based Metabolomics Datasets of Increasing Complexity
-
Published:08 Dec 2020
-
Special Collection: 2020 ebook collection
B. Percival, M. Gibson, J. Leenders, P. B. Wilson, and M. Grootveld, in Computational Techniques for Analytical Chemistry and Bioanalysis, ed. P. B. Wilson and M. Grootveld, The Royal Society of Chemistry, 2020, ch. 1, pp. 1-40.
Download citation file:
Notable historically-developed composites of advanced forms of statistical analysis and analytical/bioanalytical chemistry have been vital to the interpretation and understanding of the significance of results acquired in research (both natural sciences and clinical) and industry, with applications in numerous fields, including biomedical sciences, healthcare and environmental sciences. Herein, multicomponent nuclear magnetic resonance (NMR) analysis is used as a model to delineate how advanced statistical tools, both univariate and multivariate, can be implemented to effectively perform complex spectral dataset analyses in metabolomic applications, and to provide valuable, validated conclusions therein. Computational techniques are now embedded into spectral interpretation from an analytical chemist's perspective. However, there are challenges to applying such advanced statistical probes, which will be explored throughout this chapter.
1.1 Introduction
Although some statistical approaches were developed much earlier, such as the pioneering Bayesian statistics conducted in the 18th century,1 the interdisciplinary usage between science and statistics has still not been fully established. At present, there is a strong affinity between statistics and science, which dates back to the late 19th century and early 20th century. Works by Karl Pearson and Francis Galton, explored regression towards the mean, principal component analysis (PCA), and Chi-squared contingency table testing and correlation.2 Later, Spearman also developed his theory on factor analysis, namely Spearman's rank correlation coefficient, and applied it to the social sciences research area.3 William Gosset was responsible for the discovery of the t-test, which is embedded in most statistical testing applied in scientific fields to date, and which unfortunately remains the most widely abused, misapplied univariate form of statistical analysis test.4 Ronald Fisher tied the aforementioned ideas together, observing Gaussian distribution accounting for both chi-squared and the t-test to formulate the infrequently used F-distribution test. Fisher later developed analysis-of-variance (ANOVA), and defined p-values for determining the level of statistical significance.5 Fisher furthered these works by applying his knowledge to genetics, in particular the observation of alleles, specifically the frequency and estimation of genetic linkages by maximum likelihood methods within a population.6 Basic statistical hypotheses, such as H0 and H1, which still stand to date, were then established7 and are still fundamental to all experimental designs.
These statistical tools are now applied to just about every field possible; however, in science every research area has an element of statistical interpretation, from genomics in diseases diagnostics, forensic science and hereditary studies, the microbiome, and the discovery of biomarkers using biological immunoassays and ‘state-of-the-art’ bioanalytical techniques. In this chapter, modern advanced statistical methodologies will be explored through a major, now commonly employed multicomponent analytical/bioanalytical chemistry technique, namely nuclear magnetic resonance (NMR) spectroscopy. Statistical approaches and challenges which are associated with the collection of large or very large NMR datasets run in parallel with those of other multicomponent analytical techniques, such as liquid-chromatography–mass spectrometry (LC/MS), and hence the NMR examples provided serve as appropriate test models. Despite some problems occasionally being experienced with low sensitivity (especially at biofluid concentrations <10 μmol L−1) and untargeted analyses, which may result in xenobiotic resonances overlapping with those of endogenous metabolites, NMR provides many laboratory advantages over LC/MS in view of its high reproducibility, non-destructive methodology with minimal sample preparation, and the simultaneous detection and determination of 100 or more metabolites at operating frequencies greater than or equal to 600 MHz.8 NMR is a suitable analytical technique across many different fields and sample types, including food chemistry, geological studies, drug discovery, forensics and an increasingly expanding number of metabolomics areas, for example lipidomics and fluxomics. Both solid and liquid samples can be analysed, which is obviously advantageous. Moreover, major recent advances in the computational capability have enhanced the applications of metabolomics-linked statistical approaches, and current software modules and their applications will also be discussed here.
Indeed, NMR as a technique advanced much later than the statistical methods, therefore combining statistical tools with the data acquired occurred subsequently. Nuclear spin was first described by Pauli in 1926,9 and these ideas were then further developed by Rabi in 1939 who created the first radio frequencies for this purpose involving a beam of lithium chloride (Figure 1.1).10 Overhauser in 1953 observed dynamic nuclear magnetisation.11 Redfield then explored the theory of NMR relaxation processes,12 and NMR was then developed from these principles by Bloch and Purcell from 1945–1970, both of whom then won several Nobel Prizes.13,14 Continuous Wave (CW) methods were used to observe the nuclei spin, experiments which use a permanent magnet/electromagnet and a radio frequency oscillator to produce two fields, B0 and B1 respectively. To produce a resonance, CW methodologies were used to vary the B1 field or the B0 field to achieve resonance. In essence, the magnetic field is continuously varied and the peak signal (resonance) is recorded on an oscilloscope or an x–y recorder. However, these methodologies have substantially advanced, and at present radiofrequency pulse sequences are applied to nuclei within a magnetic field, B1.
A range of NMR facilities are currently available, with the highest operating frequency reaching 1200 MHz, which requires a Gauss safety line and regular cryogen fills, in addition to more accessible permanent non-cryogen-requiring benchtop magnets operating at a frequency of up to 80 MHz. Progress in low-field NMR spectroscopy and their analytical applications, albeit very uncommon in metabolomic applications, has been recently reviewed.15 A whole plethora of spectrometers exist between these two extreme frequencies, and these all have the capacity and capability to acquire a wide range of molecular analyte data, which can subsequently be employed for statistical evaluations and comparisons. Biological fluids have been examined using both low-16 and high-frequency17 NMR technologies for monitoring of a range of endogenous metabolites and xenobiotics, although the use of high-frequency spectrometers is often the preferred approach because of the much-enhanced level of resolution and deconvolution of NMR signals. An example of the differences in resolution observed between low- (60 MHz) and high-field (400 MHz) NMR analysis is shown in Figure 1.2.
The majority of biological fluids are predominantly composed of water, such as urine and saliva, and therefore appropriate pulse sequences have been developed to suppress these resonances in order to focus on those of interest. Pulse sequences such as water excitation technique (WET), nuclear Overhauser effect spectroscopy (NOESY),18 PRESAT and WATERGATE19 are highly suitable for the analysis of spectra containing such broad signals arising from the 1H nuclei in H2O. The water signal can be irradiated at its characteristic frequency, and hence is unable to resonate, and this strategy serves to reveal metabolites at smaller concentrations that are located at similar frequencies (chemical shift values, δ).
Furthermore, all biological fluids such as blood serum or plasma contain low-molecular-mass metabolites in addition to large macromolecules, usually proteins and lipoproteins. These metabolite signals are then superimposed on the broad signals of the macromolecules, leading to signal loss and broadening. In this specific case, applying a CPMG pulse sequence makes it possible to overcome this problem by exploiting the differences in the relaxations of metabolites and macromolecules. This sequence deletes the fast-relaxing signals arising from large macromolecules, such as proteins, from the spectrum by applying a spin-echo pulse.20
Analysis of biofluids has been used for both pharmacokinetic and biomarker detection in many studies, and requires multidimensional data analysis using highly sophisticated, but nevertheless comprehensive, statistical techniques.21
1.2 Brief Introduction to Metabolomics
Although the metabolome was first explored centuries ago through organoleptic analysis, such as the smelling of urine as a means of diagnosing diseases, applications of bioanalytical techniques to analyse and determine molecules in urine were first performed by Pauling et al. (1971),22 and the human metabolome was described by Kell and Oliver (2016).23 Nicholson and Sadler's groups’ pioneering work first detected drug metabolites using 1H NMR analysis in the early 1980s, observing acetaminophen and its corresponding metabolites.24 In addition, this group was the first to monitor metabolic conditions in urine samples.24,25 Over the last 20 years, the metabolome itself, and the means for the application of multianalyte bioanalytical techniques to be applied to it, have been further defined by others such as Nicholson and Lindon.26 Many developments have been explored in the literature with a progressive history for techniques employed for metabolomics studies, along with their establishment as key investigatory tools, as documented by Oliver and Kell.23 The study of the metabolome allows for the monitoring of metabolic processes within a system by identifying and determining low-molecular-mass metabolites within biofluids, tissues or cells. Indeed, the metabolome can be affected by many biological processes. These can either be through external stimuli such as an intervention, such as a medication, diet or exercise regimen, or alternatively through internal stimuli. An internal stimulus can be introduced via the modification of gene expression using techniques such as cell transfection, both in vivo and in vitro. Moreover, metabolomics techniques assess these changes by providing a ‘snapshot’ of the status of biological processes occurring at a specific point in time. This responsive information can provide a high level of detail regarding cellular metabolic processes, and can facilitate phenotypic evaluations, and hence yield an overall ‘picture’ or ‘fingerprint’ of the chemopathological status of a disease. Even more valuably, metabolomics is able to probe the changing disease status, for example, the effects of a drug treatment, the removal of a tumour, regression of an inflammatory condition, and so forth. Hence, these strategies may be successfully employed to monitor the severity and progression of a disease; information which further increases our understanding of the aetiology, together with the manifestation and progression of particular conditions.27
Two approaches can be undertaken in metabolomics; they are chosen primarily by the objectives of the study and the hypotheses formulated:28
A targeted approach focused on the quantitative analysis of a limited number or class of metabolites that are linked by one or more specific biochemical pathways. This makes it possible to compare variations in metabolites in a precise and specific way.
The objective of an untargeted approach is to detect “all” metabolites present in biological samples. It involves simultaneous analysis of as many metabolites as possible.
Targeted analysis is usually performed by mass spectrometry (MS), or more specifically MS as a detection system for liquid- or gas-chromatographically-separated metabolites, the sensitivity of which designates it as an excellent method to quantify and identify a set of specific metabolites. By not requiring prior knowledge about compounds present in a sample, and focusing on the global metabolic profile, NMR is ideally suited for an untargeted approach, but thanks to its easy quantification capacity, it can also be used in a targeted manner for selected metabolites.
NMR is able to detect and quantify metabolites in a high-throughput, simultaneous, non-destructive manner and requires minimal sample preparation. LC–MS strategies can target specific metabolites more sensitively than NMR, whilst lacking the specificity of the latter for absolute metabolite determinations.29 Therefore, a combination of both LC–MS and NMR methodologies is important for a global understanding of the metabolic effects involved in disease processes, and can be essential for thorough metabolite determinations; a description of the full strengths and weaknesses of these techniques can be found in Nicholson and Lindon.26
A number of other analytical techniques can be applied, such as Fourier transform infrared spectroscopy (FTIR), ultraviolet-visible spectroscopy (UV–Vis) and derivatives of NMR and LC–MS.30 However, their applications remain limited and specific, each one having its own advantages and limitations.
When handling colossal and complex datasets from untargeted metabolomics approaches, which include a plethora of metabolites at a range of concentrations, it can be difficult to interpret and understand the significance of the datasets acquired. Multivariate (MV) statistics have been integrated into multicomponent NMR analysis in view of the number of datapoints produced from the output for analysing complex mixtures such as biofluids. MV statistics aids in the processing of large datasets into visual formats, which are more comprehensive for the analyst via the recognition of patterns or signatures in multianalyte datasets. The combination of using a scientific technique alongside MV and univariate statistical analysis strategies in this manner is termed ‘chemometrics’.
1.3 Experimental Design for NMR Investigations
Any metabolomic approach is usually initiated by one or more biological/clinical questions to which the clinician or the biologist wishes to respond. Whether monitoring the effects of treatment or improving diagnostic tools, the experimental design must be carefully considered by assessing all sources of variation in order to reduce bias and avoid the introduction of irrelevant variability. Samples should be collected with appropriate research ethics approval and informed consent of all participants (both healthy and diseased, where relevant), and in metabolomics studies it is important to maintain a consistent cohort. Physiological factors, such as age, fasting or non-fasting protocols, gender, diet, physical activity, medical conditions, family medical history, genotype, and so forth, should all be taken into account prior to sample collection.8 Moreover, analytical factors should also be considered; for example, samples should be collected in a uniform manner, for example using the same collection tubes, whilst remaining mindful that some collection vessels may have signals with NMR resonances and hence interfere with the spectra acquired. This is common with anticoagulants when collecting blood samples, for example ethylenediaminetetraacetic acid (EDTA), will provide not only the signals from the EDTA chelator itself, but also those of its slowly-exchanging Ca2+- and Mg2+-complexes.31 Likewise, the citrate anticoagulant also has intense 1H NMR signals. Lithium heparin tubes are often recommended for plasma collection in order to avoid interfering signals. Contamination is possible in the early stages of sample collection to ensure sterility, and this also needs to be considered with sample transportation.
Sample stability is also an important experimental factor. Some samples are unstable when exposed to ambient temperature, which can occur if samples are on the autosampler. This can cause degradation, changes in the concentration or a complete loss of metabolites. Common pitfalls of biofluid storage include microbiological contamination of samples at ambient or even lower temperatures. Biological fluids should be stored at low temperatures, typically −80 °C prior to analysis.32 Sodium azide can be added to samples to ensure that microbes do not infiltrate, grow and interfere with metabolite levels in the sample whilst samples are maintained at ambient temperature on an NMR sample belt.33 Furthermore, freeze–thaw cycles should be minimised; indeed, it has been shown that no more than three freeze–thaw cycles are suitable for plasma sample analysis.32
A suitable internal or external standard will provide a reference signal outside the area of analyte interest without interference with the sample. For example, an internal standard can be added to samples of saliva and urine, predominantly sodium 3-(trimethylsilyl)[2,2,3,3-d4] propionate (TSP). However, TSP can bind to proteins in plasma, serum, and synovial and cerebrospinal fluid samples,34 and therefore is not added in these cases. A more suitable internal standard could be added such as 4,4-dimethyl-4-silapentane-1-ammonium trifluoroacetate which has been shown to have limited interactions with proteins,35 or an external standard such as the use of a capillary. 4,4-Dimethyl-4-silapentane-1-ammonium trifluoroacetate has also been proposed as a suitable internal standard that does not interact with cationic peptides. Other useful internal standards which have been used for C2HCl3-based biofluid and tissue biopsy lipid extracts include 1,3,5-trichlorobenzene and tetramethylsilane, although the latter is not recommended as it readily evaporates. Electronic Reference To access In vivo Concentrations (ERETIC) can also be used as an electronic standard for biofluid or tissue extract NMR samples.34
Buffering of the sample is also important in NMR as changes in the pH can modify the chemical shift values.34 Indeed, biofluid pH values vary significantly between sample classes in vivo. Biofluids are typically buffered to pH values of 7.0 or 7.4.34 For example, in blood plasma the 1H NMR profiles of histidine, tyrosine and phenylalanine are all affected by pH, and NMR-invisibility of both tyrosine and phenylalanine is possible at neutral pH in view of their binding to albumin.36
Appropriate extraction techniques can be performed for solid samples such as leaves, seeds, biological tissues, cells, foods, drugs and so forth. However, it is critically important that no solids are retained in liquid sample extracts as this will interfere with the homogeneity of the magnetic field. Freeze-drying and/or drying with liquid nitrogen, followed by vortexing and centrifuging, are often necessary to ensure there is no retention of any solid sample debris.34
Acquisition parameters should, of course, be maintained as uniform throughout the acquisition process. Full recommendations for such parameters are available in Emwas et al.33 The temperature in the NMR room and, more importantly, within the NMR probe should be consistent. Pulse parameters from the number of scans, acquisition time and number of acquisition points should be kept constant. The NMR instrument should be shimmed, tuned and matched appropriately.33 Occasionally, backward linear prediction (BLIP) and forward linear prediction (FLIP) can be appropriate in order to remove artefacts in the spectra, or if the acquisition time is too short giving rise to a truncated free induction decay (FID) respectively, processes that result in improved resolution. In metabolomics studies, these need to be used consistently throughout the post-processing stage in order to ensure that signals are not present as part of a “ringing pattern” or noise.
The experimental design ensures that the results acquired are indeed statistically significant and are not present owing to an error in the early sampling stages. Guidelines for urinary profiling have been established in the literature.33 However, there is no harmonisation throughout the field in view of the ranges in the NMR field strength and experimental parameters, such as a range of pulse sequences.
1.4 Preprocessing of Metabolomics Data
Before performing statistical analyses on spectral data, it is important to apply several pre-treatment steps that will ensure the quality of the raw data and limit possible biases. Indeed, in view of possible imperfection of the acquisition (noise acquisition), the signal processing, as well as the intrinsic nature of the biological samples (such as a dilution effect between the samples), it is very often necessary to apply some correctional measures to the spectra acquired.37 Differential methods of treatment can be used and each of them has its advantages and disadvantages. The choice of a method depends on the biological issue to be addressed, on the properties of the samples analysed, and on the methods of data analysis selected. Most of these post-processing steps are applicable to not only NMR datasets, but also LC–MS ones.
(A) Raw NMR spectral data acquisition
One of the crucial steps is to ensure appropriate dilution of the metabolites so that the internal standard, if one is used, can be referenced. Furthermore, internal standards which are much lower in concentration than that of the monitored metabolite may indeed give rise to inaccurate results. Water suppression may also dampen the intensities of signals present in close proximity to the water signal; for example, it has been shown that both the α- and β-glucose proton resonances can both be significantly, or even substantially, affected by these suppression techniques if the power is not adjusted accordingly using certain pulse sequences, for example NOESY PRESAT.16,33,38 Further important quality-control assessments may include the recognition of drug signals, corresponding metabolites, alcohol, and so forth which are commonly found in biofluid matrices. These resonances, if not properly identified, may interfere and lead to false levels of statistical significance, misassignment of biomolecule signals, and drug-induced modifications to the biofluid profiles explored. Therefore, positive signal identification is important in ensuring valid statistical significance and will be discussed in greater depth below.
(B) Phase and baseline corrections
Phase correction is crucial in order to ensure that signals are uniform, and no negative signals or baseline roll are present. These can cause elevated or decreased bucket intensity regions which could inflate the degree of statistical significance. Baseline correction can also ensure accurate signal integration.33
(C) Alignment
A regular problem encountered during data processing is a signal shift between the different NMR profiles of different samples. Several parameters can influence these peak shifts: instrumental factors, pH modifications, temperature variations, different saline concentrations or variable concentrations of specific ions. This problem is frequently encountered in urinary samples with pH values which are particularly variable, and which are subject to important variations in dilution.39 Several algorithms are available to realign peaks, each method has its own advantages and drawbacks. By shifting, stretching or compressing the spectra along their horizontal axis, this method maximizes the correlation between them (Figure 1.3).
(D) Bucketing
Classically-acquired NMR spectra correspond to a set of several thousand points. NMR spectral data acquired on biological tissue extracts or biofluids contains information corresponding to about 50 to 100 metabolites. This gap between the number of variables available and the number of useful variables must be reduced before statistical processing. This stage of segmentation, called bucketing (or binning) must firstly reduce the dimensionality of the dataset in order to extract N variables from each acquired metabolic profile (spectrum). This approach also diminishes the problem of spectral misalignment.
The most common segmentation technique comprises segmentation of the spectrum into N windows of the same width, otherwise known as bins or buckets. These buckets are usually of a size between 0.01 and 0.05 ppm. The total area within each bucket is measured instead of individual intensities, leading to a smaller number of variables. However, because of the lack of flexibility in the segmentation, some areas from the same resonance or peak could be split into two or more bins, dividing the chemical information between several bins, which could influence the data analysis subsequently conducted. To answer this problem, segmentation of variable intervals was developed. This technique, called intelligent bucketing, attempts to split the spectrum so that each bucket contains only one signal, peak or pattern (Figure 1.4). Of note, this method is highly sensitive to pH variations, and therefore the spectral realignment needs to be optimal before it is applied.8
Subsequently, bucketing is performed, which involves integration of the signal to create an NMR data matrix. It is important that an alignment approach is performed in order to ensure that regions which are bucketed are not splitting across two such integration areas rather than one.
(E) Normalisation
Normalisation is then performed in order to maximise the information to be extracted while minimising the noise and variability arising from any sample dilutions involved. It is applied to the dataset of each spectrum, and attempts to render the samples comparable to each other, as well as between them across repeated runs. Furthermore, it allows minimisation of possible biases introduced by the experimenter when collecting, handling and preparing samples.40 In an optimal situation, a metabolite constitutively expressed in biofluids or tissues could serve as an internal standard. One of the only metabolites used in metabolomics analysis for that purpose is creatinine, and creatinine normalisation is widely applied to urine samples. However, it remains controversial, with more and more studies linking creatinine variations to age, weight, exercise or gender.41,42 Moreover, creatinine normalisation should not be applied in solutions containing more than 2.5% (v/v) 2H2O as deuterium has been shown to exchange with the 1H nuclei of the –CH2 function of this biomolecule, a process which gives rise to time-dependent decreases in the intensity.43 To overcome this lack of reliable internal standards, several varieties of standardisation methods have been developed.
Normalisation can either be expressed as a percentage across the entire spectrum, or alternatively, signal intensities can be expressed relative to that of an internal standard. Resonances which may be of some metabolomics or diagnostic/prognostic importance may also be required to be removed prior to this process, including those of xenobiotics, urea and water in biofluid samples, for example.
Quantile normalisation ensures the same distribution across all spectral bins by organising them in ascending order and calculating the means of these. If spectra share the same distribution, all the quantiles will be identical, for example the mean of the highest concentration metabolites will be reflected with this normalisation method.44 However, the highest concentration metabolite may vary significantly across different samples, and therefore this mean value may not be applied across all samples. Following this normalisation method, each feature will consist of the same set of values; however, within features, the distribution will be different.44 Similarly, cubic splines aim to provide the same distribution of metabolite features, however, both non-linear relationships are assumed between the baseline and individual spectra. In the cubic spline method, the geometric mean of all spectral features is calculated. A cubic spline is then fitted between the baseline and the spectral intensities several times in order to achieve normalisation. However, variance stabilisation normalisation (VSN) operates differently to the above described methods, and successfully maintains a constant variance for each predictor variable within the dataset.
Other methods of normalisation include probabilistic quotient normalisation, histogram matching, contrast normalisation and cyclic locally-weighted regression can be considered for use in metabolomics datasets, but are beyond the scope of this work.
(F) Scaling or bucket normalisation
Scaling can then be completed in order to standardise each bucket region. Indeed, the buckets associated with the most concentrated metabolites have a greater variance than others. Consequently, some buckets may have greater weights than those of others in variance-based multivariate data analyses. To avoid this bias, it is essential to rescale the weight of each variable. This can be performed using autoscaling, by subtracting the mean centre point from each observation and dividing by the variance, or the widely preferred Pareto scaling, which also subtracts the mean centre point, but is then divided by the square root of the standard deviation. Hence, Pareto-scaled variables do not have unit variance, but variances are all close to unity, albeit different. For example, urinary metabolites present in small concentrations such as formate will produce lower intensities, in view of the lower concentrations compared to that of creatinine, and therefore scaling these accordingly ensures normality for each variable and column (metabolite) variance homogeneity, despite their original concentrations.
Scaling methodologies have been reviewed by Gromski et al.,45 and these suggest that VAST (variable stability) scaling is the best methodology for NMR data; this represents an extension of autoscaling. Small predictor variable metabolite variations are accounted for using this method as post autoscaling data are multiplied by a scaling factor and then divided by the standard deviation.45 Other scaling methodologies include range-level-scaling which are not explored herein.
Transformation of data is also useful to ensure a bell-shaped data distribution, that is reducing distributional skewness. Indeed, logarithmic or cube root transformations are often recommended for metabolomics datasets.
Some authors recommend spectral or chromatographic smoothing to ensure noise reduction; however, clearly small signals need to be retained as much as possible by this process.46 Overall, the quality of pre-processing spectral data prior to statistical analysis determines the quality and accuracy of the results.
1.4.1 Spectral Prediction in Positive Signal Identification
Positive signal identification can be performed without statistical approaches, and there is a plethora of metabolites and identification platforms such as the Human Metabolome Database (HMDB),47 MetaboLights,48 Biomagresbank (BMRB),49 Spectral database for organic Compounds (SDBS), Madison-Qingdao Metabolomics Consortium Database (MMCD),50 The Birmingham Metabolite Library (BML-NMR),51 NMRshiftDB52 and the metabolomics workbench,53 which can markedly facilitate signal identification. It is important in NMR to account for the multiplicity, integral, J couplings, and chemical shift values prior to the assignment of signals. However, statistical approaches have also been used in conjunction with bioanalytical techniques in order to identify signals which are correlated to each other, a process also facilitating assignments. These methodologies have been demonstrated in model sample systems containing just a single molecule, and complex mixtures such as a biofluid sample, and hence provide a pseudo-two-dimensional NMR spectrum. Figure 1.5 shows confirmation of the identity of n-butyric acid using the most recently developed statistical total correlation spectroscopy (STOCSY) strategy as applied to faecal water, which shows the ability of this technique to tackle such complex mixtures.54 Figure 1.6 shows the elucidation of a mixture of sucrose and glucose signals utilising the STOCSY approach.
STOCSY has also been used in conjunction with the statistical recoupling of variables giving rise to an R-STOCSY approach that shows correlations between distant clusters.56 Previous methodologies of applying statistics to NMR spectra, predominantly from Nicholson's group, which include statistical heterospectroscopy (SHY) and accounts for covariance between signals, and which can be used to observe correlations between two applied analytical techniques such as MS and NMR, or STOCSY to MS, has been previously demonstrated by Crockford et al.57 and Nicholson et al.58 However, sufficient computing power is required to perform these techniques.57 Diffusion order (DO) and STOCSY have also been combined to yield an S-DOSY technique which can be employed for complex mixture analysis in order to facilitate assignments, the deconvolution of overlapping metabolite signals, and simple comparisons of the diffusional variances in signals.55
Other useful non-statistical techniques include 2D NMR such as 1H–1H COSY and 1H–13C HSQC to help with the assignment of NMR signals, without involving such statistical complexity. Metabolite prediction has also been trialled by observation of the chemical shift and the concentration of the biofluid itself, comparing relationships between these two elements in order to provide a chemical shift and concentration dataset matrix, From this a prediction model was constructed including an algorithm model and salient navigator signals, such as those of creatine, creatinine and citrate, to aid with prediction capability.59 The idea of the chemical shift prediction remains in the early stages of development, and requires uniform sample preparation and operating frequencies in order to achieve successful assignments, as has been demonstrated for proteins60 and multicomponent biofluids.59
1.4.2 Statistical Methodologies Applied to NMR Spectra
At present there are numerous computational packages that can support the statistical analysis of NMR datasets. These include XLSTAT2019, an add on to excel, Metaboanalyst 4.0,61 an online user-friendly interface using R scripts, SIMCA, an all-in-one software for multivariate analysis, MatLab, MVAPACK, Python and R Programming, script-based programming languages with packages. The majority of statistical methodologies can be applied with any of the aforementioned software which, are predominantly available free for researcher use.
Statistical analysis can be univariate or multivariate, which both offer advantages and disadvantages, most of which are covered by Saccenti et al.62 Univariate analysis is simple to implement; however, it does not consider inter-relationships between metabolite concentrations. Metabolites can be independent or dependent, but are interlinked via pathways and could be correlated to other metabolites in the system. Notwithstanding, statistical power is also limited by the observation of only one metabolite. Multivariate analysis can be problematic in view of the high dimensionality of data, a process causing the masking of metabolites, and noise or unimportant variables appearing significant when this is indeed not the case. Usually, a combination of univariate and multivariate statistics applied in such cases addresses these issues. Most of these tests need to take into account certain assumptions which can be found in any statistical textbook, for example that the data has been suitably preprocessed, normalised and scaled to unit or near-unit variance as discussed above.
Each statistical technique herein will be described, and a case study showing statistical applications in 1H NMR spectral analysis will be considered. A range of applications will be explored to show the diversity of fields in metabolomics, but the predominant theme will be biofluids and liquid biopsies. A summary table showing the advantages and disadvantages of each technique respectively in metabolomics applications will be provided at the end of this chapter. Often, a combination of techniques will be used in order to classify and provide a statistical significance to the results acquired, that is 1D and 2D NMR spectra, LC–MS and so forth, which is required for validation. This chapter covers the most frequently applied statistical methods employed in metabolomics research investigations at present.
1.5 Univariate Data Analysis
Univariate data analysis is crucial in any metabolomics data analysis strategy. A variable may be insignificant in a multivariate model, but significant in a univariate context. This is because multivariate models can often miss/mask significant variables as all metabolites (and metabolite relationships) are simultaneously examined. Hence, it is important that univariate data analysis is integrated into metabolomics experimental designs. This is particularly salient for validation purposes for specific potential biomarkers.
Student's t-tests can be used in order to discover statistical significance in univariate datasets consisting of two sample comparisons, or more if suitable corrections are applied for a false discovery rate. There are several variations of this test which rely on similar concepts, including the unequal variance t-test derivative, and the unrecommended non-parametric Mann–Whitney U test. Typically, these tests can be paired and unpaired, and are used in conjunction with the variable type, whether this be dependent or independent respectively. An unpaired t-test will evaluate the statistical significance of any differences between mean values between two independent groups. Degrees of freedom are considered in order to establish statistical significance. As with all other parametric tests for evaluating differences between mean values, critical assumptions of normality, intra-sample variance homogeneity, and in cases of randomized blocks ANOVA without replications so that predictor variable interactions may not be considered, and additivity all apply.
In which x̄ represents the mean, µ0 is the null hypotheses, s is the standard deviation, n is the sample size, s12 and s22 are the variance with the associated numerical value indicating the group number, n1 and n2 represent the sample size with the associated numerical value indicating the group number, and s2 is the pooled sample variance.
In eqn (1.1) and (1.2) the degrees of freedom are (n−1). In eqn (1.2) when calculating the degrees of freedom, (n−1) is used in which n represents the number of paired samples. The Welch–Satterwaite equation is required for calculation of the degrees of freedom calculation in eqn (1.4).
Percival et al. applied a paired student t-test for a metabolomics investigation monitoring methanol and other metabolites in saliva using 1H NMR analysis.38 Two samples were taken from smoking participants prior and subsequent to smoking a single cigarette; thus, a paired test was appropriate. The paired student t-test showed highly significant differences between molecules such as methanol and propane-1,2-diol, which were significantly elevated post-smoking, with significance levels of p=<10−6 and 2.0×10−4 respectively.
The Mann–Whitney-U test counts the number of times the null hypothesis is proven false and this process is completed for both sample groups. The U statistic is then calculated, and is equivalent to the area under the receiver operating characteristic (ROC) curve which will be described in more detail later.
Fold-change analysis can also be performed to assess the degree of change in variable levels, and can be used to describe an increase of “X-fold” per sample classification. It is simply a ratio of two mean values.
1.5.1 ANOVA, ANCOVA and Their Relationships to ASCA
Analysis of Variance has been successfully applied in metabolomics investigations such as the detection and determination of methanol in smokers’ salivary profiles using 1H NMR analysis;38 one typical experimental design is shown in eqn (1.6). This analysis of covariance (ANCOVA) model included the between sampling time-points Ti, smoking/non-smoking groups Sj, between participants P(j)k and between gender sources of variation, Gl The mean value, µ and unexplained error, eijkl are also incorporated into this mathematical model, in addition to the first-order interaction effect between the smoking/non-smoking groups and sampling time points, that is TSij Participants were ‘nested’ within treatment groups.
This ANCOVA test complimented the results acquired in the aforementioned paired students t-test but is particularly advantageous as the ANCOVA model factored in all possible sources of variation, including interaction effects and unexplained errors. However, ANOVA or ANCOVA models can be applied in different manners which are also applicable to metabolomics applications. ANOVA-simultaneous component analysis (ASCA), for example, allows comparison of data which has been acquired on the same human participants at increasing time-points, or when considering alternative second variables. It can handle two experimental factors, but also observe the factors separately, along with their magnitudes by operating with the use of a combination of ANOVA factors with PCA, the latter of which is described below. It can also isolate ASCA contributions from statistical interaction effects, just as it can in univariate ANOVA and ANCOVA models.
Factorial ANOVA can handle more than one factor simultaneously at multiple levels. Repeated measurements of ANOVA can also be applied in longitudinal studies. ANOVA techniques are generally more applicable in targeted metabolomics using MS; however, there are a few examples seen in NMR-based metabolomics applications, as discussed below.
ANCOVA can account for qualitative and quantitative variables.
Ruiz-Rodado et al. successfully applied ANCOVA and ASCA to 1H NMR metabolomics datasets from mice with Niemann–Pick Disease, Type C1.63 The ANCOVA model accounted for three factors and six sources of primary variation as shown in eqn (1.7), to provide the univariate predictor variable, Yijk. Between disease classifications, Di, between genders Gj and the experimental sampling time points, for example, for 3, 6, 9 and 11 week-old mice, Tk were incorporated into the experimental ANCOVA design. Interactions between each variable were also considered, for example DGij, DTik and GTjk. Interaction effects are computed to assess the dependence of the effects or significance of one variable at different levels or classifications of another µ represents the mean value in the population in the absence of all sources of variation, and eijk is the residual (unexplained) error contribution.
Once key features were identified using multivariate analysis, ANCOVA was applied in a univariate context in order to reveal information regarding significant metabolites that were time-dependent, such as 3-hydroxyphenylacetate, and gender-dependent such as tyrosine. Moreover, this tool was able to show significant metabolites for a combination of variables, for example the “time-point x disease” interaction effect revealed inosine as one of the significant biomarkers; in addition, the “gender x disease” interaction effect showed a combined lysine/ornithine resonance as one of the significant distinguishing spectral features. Thus, this technique can be used successfully in metabolomics across numerous markers, and provide distinct p values for each metabolite investigated. False discovery rates and power calculations can be applied, which will be discussed below.
An alternative to ASCA is multilevel simultaneous component analysis (MSCA), which can also allow for paired datasets and divides the data into two parts, for example age and sex, and then monitors the variance associated within and between each variable.64 ASCA supersedes MSCA, as it is simply a MV extension with the benefit of ANOVA, and this explains why it is less commonly used in metabolomics studies, as other multilevel techniques are more frequently applied.
1.6 Multivariate Analysis
It is, of course, essential to incorporate multivariate data analysis into a metabolomics investigation. Univariate data analysis can consider some metabolites insignificant, but this is not the case in a multivariate context, generally because its effects only correlate, perhaps strongly, with one or more of a pattern of other metabolite variables. Moreover, the insignificance of a variable in univariate analysis could also be explicable owing to high levels of biological and/or measurement variation. Multivariate analysis may perhaps overcome this problem by further explaining classifications attributable to biological/measurement variations, and is able to combine variables together as components by their correlations and inter-relationships.
1.6.1 Principal Component Analysis
The most common unsupervised multivariate method is termed PCA, which is particularly useful for data mining and outlier detection. It summarises the variance contained in the dataset in a small number of principal components (PCs, latent variables). The principle consists of applying a rotation in the space of the N-dimensional variables, so that the new axis system, composed of the principal components, maximises the dispersion of the samples. The first principal component (PC1) represents the direction of the space containing the largest variance expressed in the analysed data. The second principal component PC2 represents the second direction of greater variance, in the orthogonal subspace at PC1, and so on. This procedure continues until the entire variance is explained and thus allows the essential information contained within the dataset to be synthesised by a limited number of PCs (usually≪N). Each PC corresponds to a linear combination of the N original metabolite variables, the weights represent the contribution of them to these components. The representation of these components makes it possible to visualise the associated metabolic signatures. One of the interests of the method is to identify, without a priori considerations, possible groupings of individuals and/or variables. However, it is possible that the primary sources of variance within a cohort of samples are not related to the effect studied, and therefore this unsupervised analysis attempts a sample/participant classification without any prior consideration of their classifications. Supervised analysis methods, however, may identify variations in metabolites which are or may be correlated with the parameters of interest of the study. Both approaches also allow the detection of samples with atypical behaviour (“outliers”) when compared relative to the remainder of the population. Figure 1.7 shows 95% confidence intervals for a typical PCA scores plot. These may be established using a multivariate generalization of Student's t-test, known as Hotelling's T2 test.30 T2 determines how far away an observation is from the centre of a PC. In Figure 1.7, the points highlighted with blue arrows are outliers.
These outliers could arise for a variety of reasons, such as xenobiotics and/or unusual or unexpected metabolites being detected in the urine or alternatively, the sample could display unexpected intensity alterations in a particular profile region. The PCA plot will not only indicate which samples are outliers, but also which principal component (PC) it is loading on (via a loadings plot), and also which other bucket regions are strongly loaded on that component. This information aids in the identification of classifications for these samples/participant donors.
Figure 1.7 shows a typical PCA plot obtained from feline urine samples with two outliers also identified. An improved example is shown in Figure 1.8 (provided by Kwon et al.65 ), who assessed green coffee bean metabolites in which a sample was removed in view of poor spectral shimming, with the unshimmed sample is shown in the inset image.
An alternative to PCA analysis is simultaneous component analysis (SCA) which takes into account different sources of variation by separating datasets into sub-matrices;64 however, PCA is more commonly employed in this field. An extension of PCA, namely group-wise PCA (GPCA) has recently been created in order to distinguish between overlapping groups of variables, and may begin to be more commonly used in metabolomics investigations in the near future.66
1.6.2 Partial-least Squares Discriminatory Analysis
Partial-least squares discriminatory analysis (PLS-DA) is a supervised MV analysis technique which is able to distinguish between disease or alternative classifications, and focuses on ‘between-class’ maximisation. This method aims to predict a response variable Y (qualitative) from an explanatory data matrix X. The components of the PLS are composed to take into account the maximum variance of the data (X) which are the most correlated possible with Y. In this case, Y is a discrete variable that takes a value that depends only on the categorical class associated with the sample. PLS analysis allows the identification of the most important response-variables in the prediction of the variable Y, and thus makes it possible to highlight the most discriminant variables between the groups, and whether metabolites are upregulated or downregulated by creating latent structures and variable importance plots (VIPs). Similar to PCA, PCs can be plotted in order to observe clusterings, with each PC being orthogonal to each-other, and with PC1 again containing the highest sample variance, and PC2 containing the next highest sample variance, and so forth. PLS-DA has many variants, including orthogonal PLS-DA (O-PLS-DA), multilevel PLS-DA (M-PLS-DA), powered PLS-DA (PPLS-DA), and N-way-PLSDA. Each has its own advantages and disadvantages for metabolomics use. For example, O-PLS-DA can only handle two groups for comparative evaluations. In this case, the orthogonal signal correction filter applied enables separation between predictive variation and correlation.
1.6.2.1 Validation Methods
Supervised analysis, unlike PCA, can lead to biased data and overestimation of the predictive capabilities of the model. Indeed, the large amount of data generates a space with a large number of dimensions in which it is almost always possible to find a direction of separation between the samples. Therefore, it is essential to ensure the quality of the models established with validation methods such as permutation testing, cross-validation and ROC analysis.
Cross-validation: Cross-validation is the most common validation method used in metabolomics. It is based on two parameters to evaluate the model's performance: R² and Q²Y. R² (X and Y) represents the explained variance proportion of the matrix of the X and Y variables, and Q²Y cumulative represents the predictive quality of the model. It can be interpreted as the estimation of R² for new data. The closer these values are to 1, the more the model is considered as predictive, and the results of the separation as significant.67
Permutation test: The objective of this test is to confirm that the initial model is superior to other models obtained by permuting the class labels and randomly assigning them to different individuals. The initial model is statistically compared to all the other randomly-assigned models. Based on this, a p-value is then calculated. If the p-value is lower than 0.5, this indicates that the initial model performs better than 95% of the randomly assigned models.68
Cross-validation and permutation tests are complementary, and both must be performed in order to validate a model. Indeed, cross-validation makes it possible to evaluate the capacity of the model to correctly predict in which class a new sample will be, while the test of permutation validates the model used.68
ROC analysis: Area under the curve receiver operating characteristic (AUROC) value can then be used to monitor the sensitivity and specificity of singular metabolites, and the performance of the test system as a whole. Sensitivity and specificity are monitored, in which a correlation of 1.0 and 0.0 can be observed, with correlations of 1 representing a perfect distinction between classes, values greater than 0.5 being considered discriminatory, and a value equivalent to 0.50 demonstrating that the model is as likely to correctly classify a sample as if one was tossing a coin.69 PLS-DA is then validated using permutation testing, which is able to define the p value for the PLS-DA discriminatory ability. Further validation can be performed using leave-one-out cross validation (LOOCV) and 10-fold cross validation in order to obtain the Q2 and R2 values; Q2 values greater than 0.5 are considered satisfactorily discriminatory. Advantageously, PLS-DA provides the VIPs, which are able to distinguish which metabolites are responsible for the distinction observed, and also whether these metabolites are up- or downregulated.
PLS-DA analysis was effectively applied to an 1H NMR investigation of brain extracts obtained from the post-mortems of patients with Huntington's disease and control patients (Figure 1.9). Permutation testing was applied in order to validate the study using 2000 permutations for the frontal lobe analysis and striatum region, yielding p values of 0.003 and <0.001.70 In addition, permutation testing was performed again using only 1000 permutations, the results showing that these values for the frontal lobe and striatum were 0.108 and 0.015 respectively, which indicated that the frontal lobe was less affected by the pathological implications of Huntington's Disease than the striatum.70 VIPs were also useful for the identification of the metabolites causing these significant differences, and their up- or downregulation status (Figure 1.10). Values for AUROC, sensitivity and specificity values were 0.942, 0.869 and 0.865 respectively, using training/discovery, and 0.838, 0.818 and 0.857 respectively using 10-fold cross validation, the results demonstrate the success of the model.
OPLS-DA has been used to discriminate between two groups using orthogonal latent structures. The OPLS-DA plots revealed a loadings diagram with a S (sigmoidal)-shaped curve, and in which the validation and permutation tests can also be performed. This S-plot can show visualisation of the OPLS-DA loadings (Figure 1.10), which is useful in the identification of significant metabolites.
The successful use of O-PLS-DA has been demonstrated by Quansah et al. observing the effects of an anti-ADHD psychostimulant, methylphenidate (MPH), on brain metabolite levels.71 Using this methodology, the researchers were able to establish significant and non-significant groupings using this approach. A significant difference was observed between the acute high 5.0 mg kg−1 dose MPH-treated and age-matched saline-treated control groups with an OPLS-DA model showing R2X=0.60, R2Y=0.54, and Q2=0.44; with a permutation test p value=0.0005. A lower acute dosage of 2.0 mg kg−1 MPH provided insignificant results when compared to the saline-treated control groups, showing R2X=0.45, R2Y=0.05, and Q2<0.1; with a permutation p value=0.93. The lower acute dose of 2.0 mg/kg given twice daily was not significantly different from that of the control group.
The significant results pertaining to the higher dosage were further analysed, and an S-plot (Figure 1.10) was obtained, and results acquired were complimentary to those obtained with ANOVA analysis and revealed significant metabolites. The more discriminatory metabolites observed in the OPLS-DA analysis can be observed at each terminal of the S-plot, and are highlighted as glucose, N-acetyl-aspartate (NAA), inosine, gamma-aminobutyric acid (GABA), glutamine (Gln), hypoxanthine, acetate, aspartate and glycine (Figure 1.10).
1.6.2.2 Canonical Correlation Analysis
Canonical correlation analysis (CCorA) is a valuable technique for revealing correlations between two sets of variables, usually predictor and response ones. This approach primarily forms independent PCs for each of the two datasets, and can then be used to explore the significance of the inter-relationships between these. This has been demonstrated in Probert et al.,31 in which scores vector datasets are derived from separate 1H NMR and traditional clinical chemistry determination datasets respectively. For this study, observations of the loading vectors showed that the total lipoprotein triacylglycerol-CH3 function-normalised 1H NMR triacylglycerol resonances, loaded strongly on PC1–PC4 from an 1H NMR-based dataset (shown in red in Figure 1.9), and the total triacylglycerol concentration-normalised clinical chemistry laboratory-determined total, low-density-lipoprotein (LDL)- and high-density-lipoprotein (HDL)-associated cholesterol levels loaded on PC1* and PC2* (shown in green in Figure 1.11). This CCorA analysis demonstrated firstly that the PC2* scores vectors positively correlated with those of PC2, and this was consistent with their common HDL sources. Secondly, PC1* was negatively correlated with PC4, that is a linear combination of plasma triacylglycerol-normalised total cholesterol and LDL-cholesterol concentrations was anti-correlated with the 1H NMR PC arising from the LDL-triacylglycerols.
1.6.2.3 Extended Canonical Variate Analysis
Extended canonical variate analysis (ECVA) uses a more complex supervised algorithm than PLS-DA, and is used in order to distinguish the maximum ratio of between-class variation to within-class variation72 ). ECVA observes individual metabolite regions, in addition to the dataset as a whole. The benefit of using ECVA is that it can discriminate between more than two groups without overfitting.
Figure 1.12 shows the number of misclassifications for each spectral interval in the average NMR spectrum of 26 wine samples. The region with the lowest numbers of misclassifications, and therefore the most discriminatory, is highlighted as the 100th interval, with only two such misclassifications. Figure 1.13 shows a scores plot of EVC3 versus EVC1 showing clear distinctions between wineries based on the selected 100th interval (Table 1.1).
Metabolomic Methodology . | Classification or Statistical . | Univariate or Multivariate . | Supervision . | Advantages (+) and Disadvantages (−) . |
---|---|---|---|---|
ANOVA | Statistical | Univariate | Unsupervised | + Hypothesis testing, with the ability to evaluate the statistical significance of a wide range of contributory variables, and their interactions, simultaneously. Partitions the total experimental variance into differential ‘predictor’ components, which may be fixed or random effects. Satisfaction of essential assumptions can be achieved with suitable transformations, for example logarithmic or square root ones. |
Student's t-test | Statistical | Univariate | Unsupervised | + Hypothesis testing, but without corrections for false discovery rate, is only appropriate for comparisons of the means of only two sample groups. |
Mann–Whitney U- test | Statistical | Univariate | Unsupervised | + Hypothesis testing – non-parametric equivalent of two sample t test. |
+ Data does not require normalisation prior to use. | ||||
Fold-change Analysis | Statistical | Univariate | Unsupervised | + Hypothesis testing; represents the ratio of two sample group mean values, and the significance of these indices may be tested. |
ASCA | Statistical | Multivariate | Unsupervised | + Can consider paired samples, for example from the same person at different time-points, or two or more possible predictor variables simultaneously. |
PCA | Statistical | Multivariate | Unsupervised | + Outlier detection |
+ Unsupervised multivariate technique for dimensionality reduction and the preliminary exploration of 2D or 3D samples or participant clusterings. | ||||
PLS-DA | Statistical | Multivariate | Supervised | + VIPs for significant metabolites |
− As it is subject to overfitting, permutation and validation testings are essential. | ||||
O-PLS-DA | Statistical | Multivariate | Supervised | + S-Plot for significant metabolites |
− Can only consider two groups, validation and permutation testing required. |
Metabolomic Methodology . | Classification or Statistical . | Univariate or Multivariate . | Supervision . | Advantages (+) and Disadvantages (−) . |
---|---|---|---|---|
ANOVA | Statistical | Univariate | Unsupervised | + Hypothesis testing, with the ability to evaluate the statistical significance of a wide range of contributory variables, and their interactions, simultaneously. Partitions the total experimental variance into differential ‘predictor’ components, which may be fixed or random effects. Satisfaction of essential assumptions can be achieved with suitable transformations, for example logarithmic or square root ones. |
Student's t-test | Statistical | Univariate | Unsupervised | + Hypothesis testing, but without corrections for false discovery rate, is only appropriate for comparisons of the means of only two sample groups. |
Mann–Whitney U- test | Statistical | Univariate | Unsupervised | + Hypothesis testing – non-parametric equivalent of two sample t test. |
+ Data does not require normalisation prior to use. | ||||
Fold-change Analysis | Statistical | Univariate | Unsupervised | + Hypothesis testing; represents the ratio of two sample group mean values, and the significance of these indices may be tested. |
ASCA | Statistical | Multivariate | Unsupervised | + Can consider paired samples, for example from the same person at different time-points, or two or more possible predictor variables simultaneously. |
PCA | Statistical | Multivariate | Unsupervised | + Outlier detection |
+ Unsupervised multivariate technique for dimensionality reduction and the preliminary exploration of 2D or 3D samples or participant clusterings. | ||||
PLS-DA | Statistical | Multivariate | Supervised | + VIPs for significant metabolites |
− As it is subject to overfitting, permutation and validation testings are essential. | ||||
O-PLS-DA | Statistical | Multivariate | Supervised | + S-Plot for significant metabolites |
− Can only consider two groups, validation and permutation testing required. |
1.7 Metabolic Pathway Analysis
Both univariate and multivariate statistical approaches enable users to explore which metabolites are up- or downregulated. However, the meaningfulness of this is not unveiled unless pathway analysis is performed. Metabolite set enrichment analysis (MSEA) and metabolomics pathway analysis (MetPA) are able to determine whether metabolite concentration changes relate to metabolic pathways, perturbations of which may be involved in the disease process explored.74,75 These features are integrated into MetaboAnalyst 4.0. Disturbed metabolic pathways involving metabolites identified and quantified by NMR analysis are identified through the exploitation of databases such as KEGG (Kyoto Encyclopedia of Genes and Genomes) or Reactome, and then reconstructed and visualized using a software tool such as Cytoscape (www.cytoscape.org), Metaboanalyst (www.metaboanalyst.ca) or MetExplore (metexplore.toulouse.inra.fr/index.html/).
1.8 Further Important Considerations: Accuracy and Precision
It is important to appreciate the challenges of applying such statistical tools to metabolomics datasets. Data should be interpreted and described in an accurate manner. Bias can easily be introduced from both analytical and biological perspectives. Although analytical variances are addressed in previous sections of this chapter, it is important not to introduce bias from sample preparation techniques, whether this be by extraction, storage or the employed procedures and analytical conditions.76 This can lead to metabolite assignment and/or concentration errors. Biological variance does need considering, and hence it is important to follow experimental design factors such as those noted above, for example potential metabolic differences between ages, genders, body mass indices, races, and so forth should also be incorporated into experimental designs, but unlike univariate analysis approaches, this is often very difficult to achieve in multivariate analysis models. Moreover, the statistical power of metabolomics experimental models can be poor in view of low sample numbers; power cannot yet be assessed a priori to a study in a multivariate sense, usually only posteriori. Validation of metabolomics datasets are required, for example ensuring that biomarkers are: (i) reproducible at another laboratory site; and (ii) decrease or elevate succinctly upon treatment, if they are indeed influenced by such processes. In a univariate sense, a prior knowledge of sample size can be determined, as indeed it can with pilot datasets in multivariate analysis.
Intriguingly, PCA can be employed to monitor ‘between-replicate’ analytical reproducibility, and Figure 1.14 shows results from an experiment involving the 1H NMR profiles of n=4 duplicate urine samples which were analysed on two separate days so that both the within- and between-assay effects could be evaluated. These results demonstrate a high level of discrimination between the individual samples analysed, and also an acceptable level of agreement between the samples analysed for the within- and between-assays performed.
1.9 Multivariate Power Calculations
An example of a retrospective power calculation is shown in Figure 1.15 from the investigation performed by Quansah et al.77 This study, monitored markers in murine brains following the administration of acute methylphenidate and used a total of n=36 samples, 18 untreated and 18 treated, and featured the observation of 13 of the biomarker variables. Predicted statistical power values of 0.99 and 1.00 were achieved considering 16 and 24 samples respectively; therefore, there was a justification for the sample size of n=18 being selected for this particular study.
Power analysis ensures that ethical boundaries are implemented appropriately.
Some techniques work better with a larger number of variables, as is the case for PLS-DA.78 Indeed, overfitting is also possible if the number of variables exceeds the number of samples, a now increasingly common experience in metabolomics research.78
Type I and Type II errors both need to be considered: the former is the improper rejection of the null hypothesis for example a false positive, and the latter assesses the rigour of the test which ensures that the null hypothesis is properly rejected, that is a false negative.76
Violation of statistical analysis approaches may sometimes import a false significance to a variable when it is not, and vice versa. Common misconceptions regarding p values and confidence interval values, and statistical power were explored by Greenland et al.79 Bonferroni and Bonferroni–Holm and Sidak corrections can be applied for type I errors, and the Benjamin–Hochberg approach can be applied for false discovery rates in univariate analysis.
1.10 Future Scope of Metabolomics Research
Metabolomics is now becoming a more integral part of diagnostics to strengthen and predict disease conditions and their progression more rapidly. Several diseases present challenging diagnostic problems. Markers within disease metabolomics can be combined with data for enhanced discrimination. An example of this is shown by Glaab et al.,80 in which metabolomics data and positron emission tomography brain neuroimaging data was combined in order to increase the discriminatory power using support vector machines (SVM) and random forest (RF) analysis strategies using LOOCV and ROC in the diagnosis of Parkinson's disease. In addition to this, it is evident in the publications consulted throughout this chapter that metabolomics is used in a variety of fields, and that a combination of statistical techniques and machine learning technologies are useful in combination. Future applications of spectral analysis are expanding to be more interdisciplinary to enable more robust models and accurate statistical analysis.
Within the NMR-based metabolomics field, one major drawback is analyte sensitivity. However, methodologies such as hyperpolarisation and enhancing technologies, for example the use of cryoprobes, is increasing the sensitivity of biomarker analytes. It should also be noted that statistical techniques and machine learning strategies are evolving and are often used in combination to effectively cope with the high dimensionality of datasets acquired in NMR-linked metabolomics. Indeed, enhancements in computer power are promoting faster turnaround times for data acquisition.
1.11 Conclusions
Statistical applications can successfully be applied to spectral and chromatographic datasets acquired in numerous fields, whether this be the diagnosis of diseases, the stratification of disease developmental stages and prognostics, or the creation of pseudo-two-dimensional spectra, as in STOCSY-type approaches. Sound relationships can be established using NMR datasets if correct standard operating procedures are followed, and which consider careful experimental design. Statistical methods can serve to distinguish between the metabolic patterns of different classifications of diseases and disease stages in both multivariate or univariate senses. Machine learning compliments statistical techniques, and aids further understanding of the clustering of the metabolites.
List of Abbreviations
- ANCOVA
Analysis of Covariance
- ANOVA
Analysis of variance
- ASCA
Analysis of variance simultaneous component analysis
- AUC
Area Under Curve
- BLIP
Backward Linear Prediction
- BML-NMR
Birmingham Metabolite Laboratory-Nuclear Magnetic Resonance
- CCORA
Canonical correlation analysis
- CV
Cross Validation
- CW
Continuous Wave
- DFA
Discriminant Function Analysis
- ECVA
Extended Canonical Variate Analysis
- EDNN
Ensemble Deep Neural Networks
- EDTA
Ethylenediaminetetraacetic acid
- FLIP
Forward Linear Predication
- HCA
Hierarchical Clustering Analysis
- HMDB
Human Metabolome Database
- HPLC
High Performance Liquid Chromatography
- KNN
K-Nearest Neighbour
- LDA
Linear Discriminant Analysis
- LOOCV
Leave one out cross validation
- LS
Least Squares
- MANOVA
Multivariate Analysis of Variance
- MLR
Multiple Linear Regression
- MMCD
Madison Metabolomics Consortium Database
- MS
Mass Spectrometry
- NMR
Nuclear Magnetic Resonance
- O-PLS-DA
Orthogonal-Partial Least Square-Discriminant Analysis
- PC
Principal Component
- PCA
Principal Component Analysis
- PLS-DA
Partial Least Squares Discriminant Analysis
- RBF
Radial
- RF
Random Forest
- ROC
Receiver Operating Characteristic
- SDBS
Spectral database for organic Compounds
- SHY
Statistical Heterospectroscopy
- STOCSY
Statistical Total Correlation Spectroscopy
- SOM
Self-Organising Maps
- SVM
Support Vector Machine
- VIP
Variable Importance Plot
The authors are grateful to Katy Woodason for useful discussions. BCP would like to acknowledge De Montfort University for her fee waiver for her PhD studies.