Chapter 9: Statistical Methods for Continuous Measured Endpoints in In Silico Toxicology Check Access
-
Published:28 Oct 2010
P. H. Rowe, in In Silico Toxicology, ed. M. Cronin and J. Madden, The Royal Society of Chemistry, 2010, ch. 9, pp. 228-251.
Download citation file:
This chapter covers three regression techniques-multiple linear, principal components and partial least squares. Toxicity data recorded as continuously varying endpoints can be predicted by these techniques, all of which combine suitably weighted values of one or more chemical descriptors. Throughout the chapter, the need to produce models that are statistically stable, demonstrably predictive and capable of interpretation in biological and chemical terms is emphasised. The identification of non-linearity, interaction and heteroscedacity are discussed and methods to overcome them (use of quadratic terms, cross-products and weighted least squares fitting) are described. Regression diagnostics are explained with coverage of the (often neglected) inspection of residuals and a warning against over-optimistic interpretation of P values when a small number of descriptors have been trawled from a large data set. The problem of testing the true predictive power of regression models is explored, the inadequacy of some cross-validation methods is described and the need for test or evaluation data sets is emphasised. The greatest modelling problem is identified as collinearity among potential predictors. One solution to this problem is the selection of a set of descriptors that includes one representative of each collinear group, rejecting other related descriptors. This can be achieved by best sets, stepwise or genetic algorithms. The alternative is to combine collinear descriptors into principal component or partial least squares scores. It is shown that these techniques can produce stable, interpretable models if used rationally, but in incompetent hands could easily lead to non-interpretable ‘black box’ models.