Chapter 10: Statistical Methods for Categorised Endpoints in In Silico Toxicology Check Access
-
Published:28 Oct 2010
P. H. Rowe, in In Silico Toxicology, ed. M. Cronin and J. Madden, The Royal Society of Chemistry, 2010, ch. 10, pp. 252-274.
Download citation file:
In this chapter different statistical approaches to analysing categorical data (Discriminant analysis, k-Nearest Neighbours and logistic regression) are discussed, along with recommendations as to the most appropriate methods for a given query. Discriminant functions are equations that use chemical descriptors to generate scores such that different categories of substance (e.g. toxic and non-toxic) yield maximally separated values. A cut-off point for the scores is then chosen. The cut-off may be selected so that compounds are simply separated according to chemical similarity (either more similar to one group or to the other) or account may be taken of the relative frequency of the two classes ('prior likelihood') to minimise the rate of misclassification or finally account may be taken of the relative costs of false positives and negatives to minimise misclassification costs. For dichotomised data, binary logistic regression uses chemical descriptors to predict the probability that a substance will belong to a particular class. The technique can be extended to predict toxicity recorded on a multi-point ordinal scale. The latter may be considerably more powerful than the binary technique.
k-Nearest Neighbour (k-NN) analysis predicts toxicity on the basis of closeness to other substances of known toxicity, in multi-dimensional space defined by chemical descriptors. Discriminant and k-NN analyses produce visually intuitive presentations when only two descriptors are used. K-NN analysis can tackle difficult patterns of data such as embedded classes that would otherwise be intractable. Used sensibly, both discriminant analysis and logistic regression can produce models capable of chemical and biological interpretation. Logistic regression is powerful, flexible, reasonably free of onerous statistical requirements and probably has the most attractive combination of characteristics, making it a good first choice.