Skip to Main Content
Skip Nav Destination

This chapter covers three regression techniques-multiple linear, principal components and partial least squares. Toxicity data recorded as continuously varying endpoints can be predicted by these techniques, all of which combine suitably weighted values of one or more chemical descriptors. Throughout the chapter, the need to produce models that are statistically stable, demonstrably predictive and capable of interpretation in biological and chemical terms is emphasised. The identification of non-linearity, interaction and heteroscedacity are discussed and methods to overcome them (use of quadratic terms, cross-products and weighted least squares fitting) are described. Regression diagnostics are explained with coverage of the (often neglected) inspection of residuals and a warning against over-optimistic interpretation of P values when a small number of descriptors have been trawled from a large data set. The problem of testing the true predictive power of regression models is explored, the inadequacy of some cross-validation methods is described and the need for test or evaluation data sets is emphasised. The greatest modelling problem is identified as collinearity among potential predictors. One solution to this problem is the selection of a set of descriptors that includes one representative of each collinear group, rejecting other related descriptors. This can be achieved by best sets, stepwise or genetic algorithms. The alternative is to combine collinear descriptors into principal component or partial least squares scores. It is shown that these techniques can produce stable, interpretable models if used rationally, but in incompetent hands could easily lead to non-interpretable ‘black box’ models.

This content is only available via PDF.
You do not currently have access to this chapter, but see below options to check access via your institution or sign in to purchase.
Don't already have an account? Register

or Create an Account

Close Modal
Close Modal