Chapter 1: Fragment Descriptors in SAR/QSAR/QSPR Studies, Molecular Similarity Analysis and in Virtual Screening
Published:29 Sep 2008
Chemoinformatics1–5 is an emerging science that concerns the mixing of chemical information resources to transform data into information, and information into knowledge. It is a branch of theoretical chemistry based on its molecular model, and which uses its own basic concepts, learning approaches and areas of application. Unlike quantum chemistry, which considers molecules as ensemble of electrons and nuclei, or force field molecular mechanics or dynamics simulations based on a classical molecular model (“atoms” and “bonds”), chemoinformatics represents molecules as objects in a chemical space defined by molecular descriptors. Among thousands of descriptors, fragment descriptors occupy a special place. Fragment descriptors represent selected subgraphs of a 2D molecular graph; structure–property approaches use their occurrences in molecules or binary values (0, 1) to indicate their presence or absence in the given graph.
The unique properties of fragment descriptors are related to the fact that (i) any molecular graph invariant (i.e., any molecular descriptor or property) can be uniquely represented as a linear combination of fragment descriptors;7–9 (ii) any symmetric similarity measure can be uniquely expressed in terms of fragment descriptors;10,11 and (iii) any regression or classification structure–property model can be represented as a linear equation involving fragment descriptors.12,13
An important advantage of fragment descriptors is related to the simplicity of their calculation, storage and interpretation (see review articles14–18 ). They belong to information-based descriptors,19 which tend to code the information stored in molecular structures. This contrasts with knowledge-based (or semi-empirical) descriptors derived from consideration of the mechanism of action. Owing to their versatility, fragment descriptors can efficiently be used to build structure–property models, perform similarity search, virtual screening and in silico design of chemical compounds with desired properties.
This chapter reviews fragment descriptors with respect to their use in structure–property studies, similarity search and virtual screening. After a short historical survey, different types of fragment descriptors are considered thoroughly. This is followed by a brief review of the application of fragment descriptors in virtual screening, focusing mostly on filtering, similarity search and direct activity/property assessment using quantitative structure–property models.
1.2 Historical Survey
Among a multitude of descriptors currently used in Structure–Activity Relationships/Quantitative Structure–Activity Relationships/Quantitative Structure–Property Relationships (SAR/QSAR/QSPR) studies,20 fragment descriptors occupy a special place. Their application as atoms and bonds increments in the framework of additive schemes can be traced back to the 1930–1950s; Vogel,21 Zahn,22 Souders,23,24 Franklin,25,26 Tatevskii,27,28 Bernstein,29 Laidler,30 Benson and Buss31 and Allen32 pioneered this field. Smolenskii was one of the first, in 1964, to apply graph theory to tackle the problem of predictions of the physicochemical properties of organic compounds.33 Later on, these first additive schemes approaches have gradually evolved into group contribution methods. The latter are closely linked with thermodynamic approaches and, therefore, they are applicable only to a limited number of properties.
The epoch of QSAR (Quantitative Structure–Activity Relationships) studies began in 1963–1964 with two seminal approaches: the σ-ρ-π analysis of Hansch and Fujita34,35 and the Free–Wilson method.36 The former approach involves three types of descriptors related to electronic, steric and hydrophobic characteristics of substituents, whereas the latter considers the substituents themselves as descriptors. Both approaches are confined to strictly congeneric series of compounds. The Free–Wilson method additionally requires all types of substituents to be sufficiently present in the training set. A combination of these two approaches has led to QSAR models involving indicator variables, which indicate the presence of some structural fragments in molecules.
The non-quantitative SAR (Structure–Activity Relationships) models developed in the 1970s by Hiller,37,38 Golender and Rosenblit,39,40 Piruzyan, Avidon et al.,41 Cramer,42 Brugger, Stuper and Jurs,43,44 and Hodes et al.45 were inspired by the, at that time, popular artificial intelligence, expert systems, machine learning and pattern recognition paradigms. In those approaches, chemical structures were described by means of indicators of the presence of structural fragments interpreted as topological (or 2D) pharmacophores (biophores, toxophores, etc.) or topological pharmacophobes (biophobes, toxophobes, etc.). Chemical compounds were then classified as active or inactive with respect to certain types of biological activity.
Methodologies based on fragment descriptors in QSAR/QSPR studies are not strictly confined to particular types of properties or compounds. In the 1970s Adamson and coworkers46,47 were the first to apply fragment descriptors in multiple linear regression analysis to find correlations with some biological activities,48,49 physicochemical properties,50 and reactivity.51
An important class of fragment descriptors, the so-called screens (or structural keys, fingerprints), were also developed in 1970s.52–56 As a rule, they represent the bit strings that can effectively be stored and processed by computers. Although their primary role is to provide efficient substructure searching in large chemical structure databases, they can be efficiently used also for similarity searching,57,58 clustering large chemical databases,59,60 assessing their diversity,61 as well as for SAR62 and QSAR63 modeling.
Another important contribution was made in 1980 by Cramer who invented BC(DEF) parameters obtained by means of factor analysis of the physical properties of 114 organic liquids. These parameters correlate strongly with various physical properties of diverse liquid organic compounds.64 On the other hand, they could be estimated by linear additive-constitutive models involving fragment descriptors.65 Thus, a set of QSPR models encompassing numerous physical properties of diverse organic compounds has been developed using only fragment descriptors.
One of the most important developments of the 1980s was the CASE (Computer-Automated Structure Evaluation) program by Klopman et al.66–69 This “self-learning artificial intelligent system”69 can recognize activating and deactivating fragments (biophores and biophobes) with respect to the given biological activity and to use this information to determine the probability that a test chemical is active. This methodology has been successfully applied to predict various types of biological activity: mutagenicity,67,70,71 carcinogenicity,66,69,71–73 hallucinogenic activity,74 anticonvulsant activity,75 inhibitory activity with respect to sparteine monooxygenase,76 β-adrenergic activity,77 μ-receptor binding (opiate) activity,78 antibacterial activity,79 antileukemic activity,80 etc. Using the multivariate regression technique, CASE can also build quantitative models involving fragment descriptors.72,77
Starting in the early 1990s, various approaches and related software tools based on fragment descriptors have been developed and are listed in several conceptual and mini-review papers.14–18 Because of the wide scope and large variety of different approaches and applications in this field, many important ideas were reinvented many times and continue to be reinvented. In this review we try to present a clear state-of-the-art picture in this area.
1.3 Main Characteristics of Fragment Descriptors
In this section different types of fragments are classified with respect to their topology and the level of abstraction of molecular graphs.
1.3.1 Types of Fragments
A tremendous number of various fragments are used in structure–property studies: atoms, bonds, “topological torsions”, chains, cycles, atom- and bond-centered fragments, maximum common substructures, line notation (WLN and SMILES) fragments, atom pairs and topological multiplets, substituents and molecular frameworks, basic subgraphs, etc. Their detailed description is given below.
Depending on the application area, two types of values taken by fragment descriptors are considered: binary and integer. Binary values indicate the presence (true, yes, 1) or the absence (false, no, 0) of a given fragment in a structure. They are usually used as screens and elements of fingerprints for chemical database management and virtual screening using similarity-based approaches as well as in SAR studies. Integer values corresponding to the occurrences of fragments in structures are used in QSAR/QSPR modeling.
220.127.116.11 Simple Fixed Types
Disconnected atoms represent the simplest type of fragments. They are used to assess a chemical or biological property P in the framework of an additive scheme based on atomic contributions:
where ni is the number of atoms of i-type, Ai is corresponding atomic contributions. Usually, the atom types account for not only the type of chemical element but also hybridization, the number of attached hydrogen atoms (for heavy elements), occurrence in some groups or aromatic systems, etc. Nowadays, atom-based methods are used to predict some physicochemical properties and biological activities. Thus, several works have been devoted to assess the octanol–water partition coefficient log P: the ALOGP method by Ghose-Crippen,81–83 later modified by Ghose and co-workers,84,85 and by Wildman and Crippen,86 the CHEMICALC-2 method by Suzuki and Kudo,87 the SMILOGP program by Convard and co-authors,88 and the XLOGP method by Wang and co-authors.89,90 Hou and co-authors91 used Equation (1.1) to calculate aqueous solubility. The ability of this approach to assess biological activities was demonstrated by Winkler et al.92
Chemical bonds are another type of simple fragment. The first bond-based additive schemes, such as those of Zahn,22 Bernstein29,93 and Allen,32,94 appeared almost simultaneously with the atom-based ones and dealt, presumably, with predictions of some thermodynamic properties.
“Topological torsions” invented Nilakantan et al.95 are defined as a linear sequence of four consecutively bonded non-hydrogen atoms. Each atom there is described by the type of corresponding chemical element, the number of attached non-hydrogen atoms and the number of π-electron pairs. Molecular descriptors indicating the presence or absence of topological torsions in chemical structures have been used to perform qualitative predictions of biological activity in structure–activity (SAR) studies.95 Later on, Kearsley et al.96 recognized that characterizing atoms by element types can be too specific for similarity searching and, therefore, it does not provide sufficient flexibility for large-scaled virtual screening. To solve this problem, they suggested assigning atoms in the Carhart's atom pairs and Nilakantan's topological torsions to one of seven classes: cations, anions, neutral hydrogen bond donors, neutral hydrogen bond acceptors, polar atoms, hydrophobic atoms and other.
The above-mentioned structural fragments – atoms, bonds and topological torsions – can be regarded as chains of different lengths. Smolenskii33 suggested using the occurrences of chains in an additive scheme to predict the formation enthalpy of alkanes. For the last four decades, chain fragments have proved to be one of the most popular and useful type of fragment descriptors in QSPR/QSAR/SAR studies. Fragment descriptors based on enumerating chains in molecular graphs are efficiently used in many popular structure–property and structure–activity programs: CASE66–69 and MULTICASE (MultiCASE, MCASE) by Klopman97,98 NASAWIN99 by Baskin et al., BIBIGON100 by Kumskov, TRAIL101,102 and ISIDA18 by Solov’ev and Varnek. “Molecular pathways” by Gakh and co-authors,103 and “molecular walks” by Rücker,104 represent chains of atoms.
In contrast to chains, cyclic and polycyclic fragments are relatively rarely applied as descriptors in QSAR/QSPR studies. Nevertheless, implicitly cyclicity is accounted for by means of: (i) introducing special “cyclic” and “aromatic” types of atoms and bonds, (ii) “collapsing” the whole cycles and even polycyclic systems into “pharmacophoric” pseudo-atoms and (iii) generating cyclic fragments as a part of large fragments [Maximum Common Substructure (MCS), molecular framework, substituents]. Besides, the cyclic fragments are widely used as screens for chemical database processing.105,106
18.104.22.168 WLN and SMILES Fragments
WLN and SMILES fragments correspond respectively to substrings of the Wiswesser Line Notation107 or Simplified Molecular Input Line Entry System108,109 strings used for encoding the chemical structures. Since simple string operations are much faster than processing of information in connection tables, the use of WLN descriptors was justified in the 1970s when computers were still very slow. At that time Adamson and Bawden published some linear QSAR models based on WLN fragments.48,50,51,110,111 They have also applied this kind of descriptor for hierarchical cluster analysis and automatic classification of chemical structures.112 Qu et al.113,114 have developed AES (Advanced Encoding System), a new WLN-based notation encoding chemical information for group contribution methods. Interest in line notation descriptors has not disappeared completely with the advent of powerful computers. Thus, SMILES fragment descriptors are used in the SMILOGP program to predict log P,88 whereas the recently developed LINGO system for assessing some biophysical properties and intermolecular similarities uses holographic representations of canonical SMILES strings.115
22.214.171.124 Atom-centered Fragments
Atom-Centered Fragments (ACF) consist of a single central atom surrounded by one or several shells of atoms separated from the central one by the same topological distance. This type of structural fragments was introduced in the early 1950s by Tatevskii,27,28,116–119 and then by Benson31 to predict some physicochemical properties of organic compounds in the framework of additive schemes.
ACF fragments containing only one shell of atoms around the central one (i.e., atom-centered neighborhoods of radius 1) were introduced into chemoinformatics practice in 1971 under the names “atom-centered fragments” and “augmented atoms” by Adamson,120,121 who studied their distribution in large chemical databases with the intention of using them as screens in chemical database searching. Hodes used, in SAR studies, both “augmented atoms”45 and “ganglia augmented atoms”325 representing ACF fragments with radius 2 and generalized second-shell atoms. Subsequently, ACF fragments with radius 1 were implemented in NASAWIN,122–124 TRAIL101,102,125 and ISIDA18 programs. ACF fragments with arbitrary radius were implemented by Filimonov, Poroikov and co-authors in the PASS126 program under the name Multilevel Neighborhoods of Atoms (MNA),127 by Xing and Glen as “tree structured fingerprints”,128 by Bender and Glen as “atom environments”129,130 and “circular fingerprints”131–133 (Figure 1.1), and by Faulon as “molecular signatures”.134–136
Several types of ACF fragments were designed to store local spectral parameters (chemical shifts) in spectroscopy data bases. Thus, Bremser has developed Hierarchically Ordered Spherical Environment (HOSE), a system of substructure codes aimed at characterizing the spherical environment of single atoms and complete ring systems.137 The codes are generated automatically from 2D graphs and describe structural entities corresponding to chemical shifts. A very similar idea has also been implemented by Dubois et al. in the DARC system based on FREL (Fragment Réduit à un Environment Limité) fragments.138,139 Xiao et al. have applied Atom-Centered Multilayer Code (ACMC) fragments for structural and substructural searching in large databases of compounds and reactions.140 An important recent application of ACF fragments concerns target prediction (“target fishing”) in chemogenomic data analysis.126,141,142
126.96.36.199 Bond-centered Fragments
Bond-centered fragments (BCF) consist of two atoms linked by the bond and surrounded by one or several shells of atoms separated by the same topological distance from this bond. Although these fragments are rather rarely used in structure–property studies, they can be efficiently used as screens for chemical database processing.143 BCF have been used as a part of MDL keys144,145 for substructure search in chemical databases, database clustering60 and for SAR studies of 17 different types of biological activity.62 Bond-centered fragments have also been used in the DARC system.138,139
188.8.131.52 Maximum Common Substructures
For a set of molecular graphs, a Maximum Common Substructure (MCS) is defined as a largest substructure in all graphs belonging to the given set. In most practical applications, only MCS for graph pairs are considered, i.e., for sets containing only two graphs. MCS can be found by intersecting molecular graphs using several different algorithms (for a review see ref. 146), the best known of which involve clique detection in so-called compatibility graphs. Notably, a pair of graphs can have more than one MCS. The main advantage of MCS fragments is related to the fact that their complexity is not limited and therefore they can be used to detect property-relevant features that could not be detected by fragments (subgraphs) of limited complexity.
MCSs were first applied to SAR studies in the early 1980s by Rozenblit and Golender in the framework of their logical-combinatorial approach.40,41,147 Since at that time computer power was limited, the authors suggested the use of reduced graphs (Section 1.3.5) built on pharmacophoric centers. The MCS fragments were subsequently applied to perform a similarity search,148 to cluster chemical databases149,150 as well to assess biological activities of organic compounds.99,151,152
184.108.40.206 Atom Pairs and Topological Multiplets
Characterizing atoms only by element types is too specific for similarity searching and, therefore, does not provide sufficient flexibility for large-scale virtual screening. For that reason, numerous studies have been devoted to increase the informational content of fragment descriptors by adding some useful empirical information and/or by representing a part of the molecular graph implicitly. The simplest representatives of such descriptors were “atom pairs and topological multiplets” based on the notion of a “descriptor center” representing an atom or a group of atoms that could serve as centers of intermolecular interactions. Usually, descriptor centers include heteroatoms, unsaturated bonds and aromatic cycles. An atom pair is defined as a pair of atoms (AT) or descriptor centers separated by a fixed topological distance: ATi-Dist-ATj, where Distij is the shortest path (the number of bonds) between ATi and ATj. Analogously, a topological multiplet is defined as a multiplet (usually triplet) of descriptor centers and topological distances between each pair of them. In most of cases, these descriptors are used in binary form to indicate the presence or absence of the corresponding features in studied chemical structures.
Atom pairs were first suggested for SAR studies by Avidon as Substructure Superposition Fragment Notation (SSFN).41,153 They were then independently reinvented by Carhart and co-authors154 for similarity and trend vector analysis. In contrast to SSFN, Carhart's atom pairs are not necessarily composed only of descriptor centers but account for the information about element type, the number of bonded non-hydrogen neighbors and the number of π electrons. Nowadays, Carhart's atom pairs are popular in virtual screening. Topological Fuzzy Bipolar Pharmacophore Autocorrelograms (TFBPA)155 by Horvath are based on atom pairs, in which real atoms are replaced by pharmacophore sites (hydrophobic, aromatic, hydrogen bond acceptor, hydrogen bond donor, cation, anion), while Distij corresponds to different ranges of topological distances between pharmacophores. These descriptors were successfully applied in virtual screening against a panel of 42 biological targets using a similarity search based on several fuzzy and non-fuzzy metrics,156 performing only slightly less well than their 3D counterparts.155 Fuzzy Pharmacophore Triplets (FPT) by Horvath157 is an extension of FBPF156 for three-site pharmacophores. An important innovation in the FPT concerns accounting for proteolytic equilibrium as a function of pH.157 Owing to this feature, even small structural modifications leading to a pKa shift may have a profound effect on the fuzzy pharmocophore triples. As a result, these descriptors efficiently discriminate structurally similar compounds exhibiting significantly different activities.157
Some other topological triplets should be mentioned. Similog pharmacophoric keys by Schuffenhauer et al.158 represent triplets of binary coded types of atoms (pharmacophoric centers) and topological distances between them (Figure 1.2). Atomic types are generalized by four features (represented as four bits per atom): potential hydrogen bond, donor or acceptor, bulkiness and electropositivity. The “topological pharmacophore-point triangles” implemented in the MOE software159 represent triplets of MOE atom types separated by binned topological distances. Structure–property models obtained by a support vector machine method with these descriptors have been successfully used for virtual screening of COX-2 inhibitors160 and D3 dopamine receptor ligands.161
220.127.116.11 Substituents and Molecular Frameworks
In organic chemistry, decomposition of molecules into substituents and molecular frameworks is a natural way to characterize molecular structures. In QSAR, both the Hansch–Fujita34,35 and the Free–Wilson36 classical approaches are based on this decomposition, but only the second one explicitly accounts for the presence or the absence of substituent(s) attached to molecular framework at a certain position. While the multiple linear regression technique was associated with the Free–Wilson method, recent modifications of this approach involve more sophisticated statistical and machine-learning approaches, such as the principal component analysis162 and neural networks.163
In contrast to substituents, molecular frameworks are rarely used in SAR/QSAR/QSPR studies. In most cases, they are implicitly involved as indicator variables discriminating different types of molecular motifs (see, for example, ref. 164). The distributions of different molecular frameworks and substituents (side chains) in the databases of known drug molecules has been thoroughly studied by Bemis and Murcko.165,166
18.104.22.168 Basic Subgraphs
Regarding fragment descriptors, one could imagine a huge number of possibilities to split a molecular graph into constituent fragments. Making a parallel with the decomposition of vectors into a limited number of basis functions, Randič326 suggested the existence of a small set of basic subgraphs representing any structure and which could be used to calculate any molecular property. In particular, for small alkanes a set of disconnected graphs representing paths (chains) of different length has been proposed (Figure 1.3).
However, later it has since been found that this set is not sufficient to differentiate any two structures. Skvortsova et al. have extended the set of Randič basic subgraphs by including cyclic fragments and more complex subgraphs consisting of single node attached to a cyclic fragment.167 This set exhibits good coding uniqueness (i.e., different vectors of descriptors correspond to different structures) and coding completeness (i.e., they can approximate a numerous structure–property functions). Basic fragment descriptors of this kind were used in several QSPR studies.168
In fact, a rigorous solution of the problem of finding a set of basic graph invariants was obtained by Mnukhin169 for simple graphs and then extended to molecular graphs by Baskin, Skvortsova et al.7–9 (Figure 1.4). It has been shown that the complete set of basic graph invariants could be built on all possible subgraphs, and hence one can not to confine this to any subset of limited size. Nonetheless, for many practical tasks the application of a limited number of basic subgraphs and the corresponding fragment descriptors could be useful.
Another application of basic subgraphs arises from the possibility8,169 of relating the invariants of molecular graphs to the occurrence numbers of some basic subgraphs. Estrada has developed this methodology for spectral moments of the edge-adjacency matrix of molecular graphs – defined as the traces of the different powers of such matrix:170–172
where μk is the k-th spectral moment of the edge-adjacency matrix E (which is a symmetric matrix whose elements eij are 1 only if edge i is adjacent to edge j) and tr is the trace, i.e. the sum of the diagonal elements of the matrix. On the other hand, spectral moments can be expressed as linear combinations of the occurrence numbers of certain structural fragments in the molecular graph. These linear combinations for simple molecular graphs not containing heteroatoms have been reported for acyclic170 and cyclic172 chemical structures.
To illustrate these notions, consider a correlation between the boiling points of alkanes and their spectral moments reported in ref. 170:
where |Fi| denotes the occurrence number of subgraph Fi in molecular graph.
Thus, any spectral moment and hence the activities/properties of chemical compounds can be represented by contributions of corresponding fragments. This approach was further extended to molecular graphs containing heteroatoms by weighting the diagonal elements of the bond adjacency matrix.171
This methodology has been implemented in TOSS-MODE (TOpological SubStructural MOlecular Design) and TOPS-MODE (TOPological Substructural MOlecular DEsign) methods,173 which were successfully used to assess various physicochemical properties of chemical compounds: retention indices in chromatography,174 diamagnetic and magnetooptic properties,175 dipole moments,176 permeability coefficients through low-density polyethylene,177 etc.), 3D-parameters178 and a different types of biological activity (sedative/hypnotic activity,173 anti-cancer activity,179 anti-HIV activity,180 skin sensitization,181 herbicide activity,182 affinity to A1 adenosine receptor,183 inhibition of cyclooxygenase,184 antibacterial activity,185 toxicity in Tetrahymena pyriformis,186 mutagenicity,187–189 etc.
22.214.171.124 Mined Subgraphs
The notion of mined subgraphs is closely linked to graph mining (or subgraph mining), a field of searching the graphs (subgraphs) specifically related to some properties or activities.190–195 The advantage of this approach is that all relevant fragments are available for analysis without the need to consider an almost infinite number of all possible subgraphs, which allows one to select the most “useful” fragments. This methodology196,197 is based on efficient algorithms for mining the most frequent fragments occurring in sets of molecular graphs, such as the AGM (Apriori-based Graph Mining) algorithm by Inokuchi et al.,198 the FSG (Frequent Sub-Graphs) algorithm by Kuramochi and Karypis,199 the chemical sub-structure discovery algorithm by Borgelt and Berthold,200 the gSpan (graph-based Substructure pattern mining) algorithm by Yan and Han,194 the TreeMiner algorithm by Zaki201 and the HybridTreeMiner and CMTreeMiner algorithms by Chi, Yang and Muntz,202,203 etc. The mined subgraphs approach was originally used to classify chemical structures.204,205 “Weighted substructure mining, in conjunction with linear programming boosting,206 allows one to build QSAR regression models involving mined fragment descriptors.195
126.96.36.199 Random Subgraphs
The success of different fragmentation schemes in SAR/QSAR studies strongly depends on the initial choice of relevant fragment types. Since it is unrealistic to consider all possible fragments because of their enormous number, one should always select their small subsets. However, any attempt to apply a limited subtype of them (e.g., to use only chains with the user specified length) risks being inefficient because of missing of important fragments. One possible solution is to generate substructural fragments using stochastic techniques. Such an approach has been used by Graham et al., who generated “tape recordings” of chemical structures from atom-bond-atom fragments extracted from molecular graphs by random walks.207 In the MolBlaster method by Batista, Godden and Bajorath, for each molecule the program generates a “random fragment profile” representing a population of fragments generated by randomly deleting bonds in hydrogen-suppressed molecular graph.208 This method was successfully applied in similarity-based virtual screening.209
188.8.131.52 Library Subgraphs
Many studies employ fixed sets of fragments taken from some libraries containing preliminary selected fragments. Thus, most additive schemes and group contribution methods have been derived using fixed sets of fragments. Some SAR/QSAR/QSPR expert systems also employ fixed sets of selected fragments and often apply an internal language specifically designed for handling the descriptors lists. For example, to describe fragments, the DEREK expert system for assessing toxicity uses the PATRAN language,210 whereas the ALogP method86 for predicting the octanol–water partition coefficient log P is based on the SMARTS line notation [as implemented in the MOE (Molecular Operating Environment) software suite159 ].
1.3.2 Fragments Describing Supramolecular Systems and Chemical Reactions
Using “special” bond types, molecular graphs can represent not only individual molecules but also more complex species: supramolecular systems, chemical reactions and polymers with periodic structure. For example, the ISIDA program can recognize a “coordination bond” between central metal atom and donor atoms of the ligand in the metal complexes and “hydrogen bond” in supramolecular assemblies.32 Varnek et al. used fragment descriptors derived from “supramolecular” graphs in QSPR modeling of free energy and enthalpy of formation of 1 : 1 hydrogen bonded complexes.18
The concept of molecular graphs can also be expanded to describe chemical reactions by introducing special types of “dynamical” bonds corresponding to formation, modification and breaking of chemical bonds (for a review see ref. 211). The resulting reaction graph contains all necessary information to reconstruct both reactants and products in the corresponding reaction equation. Partial reaction graphs containing only “dynamical” bonds were used to classify and enumerate organic reactions in the framework of Ugi–Dugundji matrix formalism212 and the Zefirov–Tratch formal-logical approach.213,214 Vladutz condensed reactants and products of a chemical reaction into a single Superimposed Reaction Skeleton Graph (SRSG)215 containing both dynamical and conventional (not modified in the reaction) bonds. Similar reaction graphs under the name “imaginary transition state” were also suggested by Fujita216,217 for classification and enumeration of organic reactions. This approach has been extended recently by Varnek et al.18 in Condensed Graphs of Reactions (CGRs) containing both “dynamical” and conventional bonds (Figure 1.6). Fragment descriptors derived from CGRs were used in similarity search of reactions, in reaction classification and in the development of QSPR models of the rate constant of SN2 reactions in water.218
To encode reaction transformations Borodina et al. have developed Reacting Multilevel Neighborhood of Atom (RMNA)219 descriptors representing an extended version of the MNA descriptors. Unlike CGRs, where reaction information is condensed, in the RMNA approach the information about modified, created or broken bonds is added to the list of the MNA descriptors generated for all products and reactants. The RMNA descriptors were applied to predict metabolic P450-mediated aromatic hydroxylation.219
1.3.3 Storage of Fragment Information
This section discusses different techniques to store the information about molecular fragments. The most common way is present a given chemical structure as a fixed-size array (vector), in which each element corresponds to the occurrence of a given molecular fragment. Structural keys are descriptor vectors containing binary values indicating presence of absence of fragments. Since structural keys can be kept in computer memory as bit strings they are processed very rapidly, which explains their popularity in chemical database management, similarity search, SAR/QSAR studies and in virtual screening (Figure 1.7).
The composition and length of structural keys always depend on the choice of constituent fragments. Often, structural keys become very sparse, i.e., they contain very few non-zero values. Such highly imbalanced data presentation is rather inefficient for computer processing. As a partial solution to this problem, fragment descriptors can be stored in a list containing the codes (names) of fragments “ON”. Although application of lists reduces the storage's size, it is still time consuming to be used for a substructural search in large databases.
Search efficiency can be improved significantly by using hash tables, allowing one to link directly the name of descriptor and location of the descriptor's value. This technology is used in hashed molecular fingerprints operating with binary values (Figure 1.8). In contrast to structural keys, in molecular fingerprints each fragment is mapped onto several cells, positions of which are computed from the fragment code. The advantage of hashed fingerprints is a possibility to include a big number of fragments in a bit string of reasonable length. Their drawback is related to the existence of collisions when two or more fragments are mapped in the same bit. Nonetheless, this problem could be solved by trade-off between the length of bit string, the number of fragments types and the number of bits allocated for each fragment.
An interesting way of encoding structural information is realized in molecular holograms, which represent an integer array of bins of predetermined length (hologram length) that contains information about the occurrences of fragments. In the course of generating a molecular hologram, each fragment is coded using the SLN (SYBYL Line Notation).220 Using the cyclic redundancy check (CRC) algorithm,221 this code is transformed into a fragment integer ID, indicating the location of the particular bin in the molecular hologram (Figure 1.9). The occupancy of bins is then incremented by one as soon as the corresponding fragments occur. Since the hologram length I always smaller than the number of fragments, several different fragments map to the same bin in the molecular hologram. The resulting bin occupancy is equal to the sum of occurrence numbers of all these fragments. Molecular holograms were specially designed to be used in the Holographic QSAR (HQSAR) approach.63
1.3.4 Fragment Connectivity
Fragments used for building fragment descriptors can be connected and disconnected. Most applications are based on connected fragments. The point is that the indicators of presence or occurrences of disconnected fragments can always be expressed through the corresponding values obtained for connected fragments.8 Hence, descriptors based on disconnected fragments are redundant, since they do not carry any additional information compared to their connected counterparts.
Nonetheless, in some cases disconnected fragments descriptors could simplify QSAR/QSPR equations. In particular, nonlinear models involving connected fragments can be replaced with linear models built on disconnected fragments, because the occurrences of disconnected and connected fragments are nonlinearly related. Thus, the use of disconnected fragments may be viewed as an implicit way of introducing nonlinearity into QSARs/QSPRs. If binary descriptor values are used, disconnected fragments implicitly introduce conjunctions (logical .AND.) into logical expressions instead of nonlinear terms for connected fragments. Tarasov et al.222 have shown that the compound structural descriptors defined as combinations of unrelated fragments improve significantly the efficiency of mutagenicity predictions. Implicitly, disconnected fragments, as conjugations of binary (logical) connected fragment descriptors, were used to build probabilistic SAR models for some biological activities (see ref. 223 and references therein).
1.3.5 Generic Graphs
In contrast to QSPR studies based on complete (containing all atoms) or hydrogen-suppressed molecular graphs, assessment of biological activity, especially at the qualitative level, often requires greater generalization. In that case, it is convenient to describe chemical structures by reduced graphs, in which each vertex – descriptor center or pharmacophoric center – represents an atom or a group of atoms capable of interacting with biological targets, whereas each edge measures the number of bonds between them. Such a biology-oriented representation of chemical structures was invented in 1982 by Avidon et al. under the name Descriptor Center Connection Graphs (DCCG)41 as a generalization of SSFN descriptors (Section 184.108.40.206).
Figure 1.10(b) shows the DCCG for phenothiazine. In this case, the reduced graph consists of 16 edges and 10 vertices corresponding to descriptor centers shown in Figure 1.10(a). Descriptor centers involve four heteroatoms (1–4; see numbering in Figure 1.10a), which can take part in donor–acceptor interaction with biomolecules and in the formation of hydrogen bonds, three methyl groups (5–7), which can take part in hydrophobic interaction with biomolecules, two benzene rings (8, 9) and one heterocycle (10), which can take part in π–π and π–cation interactions with biomolecules. Eleven edges in the DCCG labeled with positive numbers indicate the topological distances (counted as the number of bonds) between the atoms included in the corresponding descriptor centers, while the negative labels denote relations between rings within a polycyclic system. Such graphs are very useful not only as a source of biology-oriented fragment descriptors but also for pharmacophore based virtual screening.
The atom-pairs proposed by Carhart et al.154 are rather similar to the SSFN descriptors. They can be considered as two-vertex connected fragments of reduced graphs, in which edges correspond to paths between certain atoms. Modifications introduced to the atom-pairs descriptors by Kearsley et al.96 through encoding physicochemical properties of atoms render these fragments even more generic. In 2003 Gillet, Willett and Bradshaw (GWB) introduced another type reduced graphs and proved their high efficiency in a similarity search.224 A GWB reduced graph consisting of six vertices and five edges is shown in Figure 1.11. Its three vertices R correspond to rings, its two vertices L to linkers, while the vertex F corresponds to a feature – an oxygen atom in this case, which can form hydrogen bonds. In contrast to DCCG, the edges of GWB reduced graphs are not labeled and correspond to ordinary chemical bonds.
An important feature of the GWB reduced graphs is a hierarchical organization of vertex labels. For example, the label Arn (non-hydrogen-bonding aromatic cycle) is less general than the label Ar (any aromatic cycle), which, in turn, is less general than R (any ring). Due to this feature, GWB reduced graphs can also be organized hierarchically, and the level of their generalization can be controlled (Figure 1.12). Besides similarity searching, fragment descriptors based on GWB reduced graphs have been applied to derive SAR models using decision trees.225
1.3.6 Labeling Atoms
In some cases selected atoms in molecules could be marked with special labels, indicating their particular role in a modeled property. Some examples are (i) local properties, such as atomic charges or NMR chemical shifts, which should always be attributed to a given atom(s), (ii) anchor atoms in the given scaffold to which substituents are attached (Figure 1.13), (iii) atoms forming a main chain in polymers and (iv) reaction centers in a set of reactions. Zefirov et al. have applied labeling in QSPR studies of pKa226,227 chemical NMR shifts and reaction rate constant for the acid hydrolysis of esters.226,228 Varnek et al.18 labeled hydrogen bond donor and acceptor centers to model free energies and enthalpies of formation of the 1 : 1 hydrogen-bond complexes.
1.4 Application in Virtual Screening and In Silico Design
This section considers the application of fragment descriptors at different stages of virtual screening and in silico design.
Filtering is a rule-based approach aimed to perform fast assessment of usefulness of molecules in the given context. In terms of drug design, the filtering is used to eliminate compounds with unfavorable pharmacodynamic or pharmacokinetic properties as well as toxic compounds. Pharmacodynamics considers binding drug-like organic molecules (ligands) to chosen biological target. Since the efficiency of ligand–target interactions depends on spatial complementarity of their binding sites, the filtering is usually performed with 3D-pharmacophores, representing “optimal” spatial arrangements of steric and electronic features of ligands.229,230 Pharmacokinetics is mostly related to absorption, distribution, metabolism and excretion (ADME) related properties: octanol–water partition coefficients (log P), solubility in water (log S), blood–brain coefficient (log BB), partition coefficient between different tissues, skin penetration coefficient, etc.
Fragment descriptors are widely used for early ADME/Tox prediction both explicitly and implicitly. The easiest way to filter large databases concerns detecting undesirable molecular fragments (structural alerts). Appropriate lists of structural alerts are published for toxicity,231 mutagenicity,232 and carcinogenicity.233 Klopman et al. were the first to recognize the potency of fragment descriptors for this purpose.66,67,69 Their programs CASE,66 MultiCASE,97,234 as well as more recent MCASE QSAR expert systems,235 proved to be effective tools to assess the mutagenicity67,234,235 and carcinogenicity69,234 of organic compounds. In these programs, sets of biophores (analogs of structural alerts) were identified and used for activity predictions. Several more sophisticated fragment-based expert systems of toxicity assessment – DEREK,210 TopKat236 and Rex237 – have been developed. DEREK is a knowledge-based system operating with human-coded or automatically generated238 rules concerning toxicophores. Fragments in the DEREK knowledge base are defined by means of the linear notation language PATRAN, which codes the information about atom, bonds and stereochemistry. TopKat uses a large predefined set of fragment descriptors, whereas Rex implements a special kind of atom-pairs descriptors (links). For more information about fragment-based computational assessment of toxicity, including mutagenicity and carcinogenicity, see ref. 239 and references therein.
The most popular filter used in drug design area is the Lipinski “rule of five”,240 which takes into account the molecular weight, the number of hydrogen bond donors and acceptors, along with the octanol–water partition coefficient log P, to assess the bioavailability of oral drugs. Similar rules of “drug-likeness” or “lead-likeness” were later proposed by Oprea,241 Veber242 and Hann.243 Formally, fragment descriptors are not explicitly involved there. However, most computational approaches that assess log P are fragment-based;244–246 whereas H-donors and acceptor sites are the simplest molecular fragments.
1.4.2 Similarity Search
The notion of molecular similarity (or chemical similarity) is one of the most useful and at the same time one of the most contradictory concepts in chemoinformatics.247,248 The concept of molecular similarity plays an important role in many modern approaches to predicting the properties of chemical compounds, designing chemicals with a predefined set of properties and, especially, in conducting drug design studies by screening large databases containing structures of available (or potentially available) chemicals. These studies are based on the similar property principle of Johnson and Maggiora, which states: similar compounds have similar properties.247 The similarity-based virtual screening assumes that all compounds in a database that are similar to a query compound have similar biological activity. Although this hypothesis is not always valid (see discussion in ref. 249), quite often the set of retrieved compounds is considerably enriched with actives.250
To achieve high efficacy of similarity-based screening of databases containing millions compounds, molecular structures are usually represented by screens (structural keys) or fixed-size or variable-size fingerprints. Screens and fingerprints can contain both 2D- and 3D-information. However, the 2D-fingerprints, which are a kind of binary fragment descriptors, dominate in this area. Fragment-based structural keys, like MDL keys,62 are sufficiently good for handling small and medium-sized chemical databases, whereas processing of large databases is performed with fingerprints having much higher information density. Fragment-based Daylight,251 BCI,252 and UNITY 2D253 fingerprints are the best known examples.
The most popular similarity measure for comparing chemical structures represented by means of fingerprints is the Tanimoto (or Jaccard) coefficient T.254 Two structures are usually considered similar if T > 0.85250 (for Daylight fingerprints251 ). Using this threshold, Taylor estimated a probability to retrieve actives as 0.012–0.50,255 whereas according to Delaney this probability is even higher, i.e., 0.40–0.60 (ref. 256) (using Daylight fingerprints251 ). These computer experiments confirm the usefulness of the similarity approach as an instrument of virtual screening.
Schneider et al. have developed a special technique for performing virtual screening referred to as Chemically Advanced Template Search (CATS).257 Within its framework, chemical structures are described by means of so-called correlation vectors, each component of which is equal to the occurrence of a given atom pair divided by the total number of non-hydrogen atoms in it. Each atom in the atom pair is specified as belonging to one of five classes (hydrogen-bond donor, hydrogen-bond acceptor, positively charged, negatively charged, and lipophilic), while topological distances of up to ten bonds are also considered in the atom-pair specification. In ref. 257, the similarity is assessed by Euclidean distance between the corresponding correlation vectors. CATS has been shown to outperform the MERLIN program with Daylight fingerprints251 for retrieving thrombin inhibitors in a virtual screening experiment.257
Hull et al. have developed the Latent Semantic Structure Indexing (LaSSI) approach to perform similarity search in low-dimensional chemical space.258,259 To reduce the dimension of initial chemical space, the singular value decomposition method is applied for the descriptor-molecule matrix. Ranking molecules by similarity to a query molecule was performed in the reduced space using the cosine similarity measure,260 whereas the Carhart's atom pairs154 and the Nilakantan's topological torsions95 were used as descriptors. The authors claim that this approach “has several advantages over analogous ranking in the original descriptor space: matching latent structures is more robust than matching discrete descriptors, choosing the number of singular values provides a rational way to vary the ‘fuzziness’ of the search”.258
The issue of “fuzzification” of similarity search has been addressed by Horvath et al.155–157 The first fuzzy similarity metric suggested155 relies on partial similarity scores calculated with respect to the inter-atomic distances distributions for each pharmacophore pair. In this case the “fuzziness” enables comparison of pairs of pharmacophores with different topological or 3D distances. Similar results156 were achieved using fuzzy and weighted modified Dice similarity metric.260 Fuzzy pharmacophore triplets (FPT, see Section 220.127.116.11) can be gradually mapped onto related basis triplets, thus minimizing binary classification artifacts.157 In a new similarity scoring index introduced in ref. 157, the simultaneous absence of a pharmacophore triplet in two molecules is taken into account. However, this is a less-constraining indicator of similarity than simultaneous presence of triplets.
Most similarity search approaches require only a single reference structure. However, in practice several lead compounds are often available. This motivated Hert et al.261 to develop the data fusion method, which allows one to screen a database using all available reference structures. Then, the similarity scores are combined for all retrieved structures using selected fusion rules. Searches conducted on the MDL Drug Data Report database using fragment-based UNITY 2D,253 BCI,252 and Daylight251 fingerprints have proved the effectiveness of this approach.
The main drawback of the conventional similarity search concerns an inability to use experimental information on biological activity to adjust similarity measures. This results in an inability to discriminate relevant and non-relevant fragment descriptors used for computing similarity measures. To tackle this problem, Cramer et al. 42 developed substructural analysis, in which each fragment (represented as a bit in a fingerprint) is weighted by taking into account its occurrence in active and in inactive compounds. Subsequently, many similar approaches have been described in the literature.262
One more way to conduct a similarity-based virtual screening is to retrieve the structures containing a user-defined set of “pharmacophoric” features. In the Dynamic Mapping of Consensus positions (DMC) algorithm263 those features are selected by finding common positions in bit strings for all active compounds. The potency-scaled DMC algorithm (POT-DMC)264 is a modification of DMC in which compounds activities are taken into account. The latter two methods may be considered as intermediate between conventional similarity search and probabilistic SAR approaches.
Batista, Godden and Bajorath have developed the MolBlaster method,208 in which molecular similarity is assessed by Differential Shannon Entropy265 computed from populations of randomly generated fragments. For the range 0.64 < T < 0.99, this similarity measure provides with the same ranking as the Tanimoto index T. However, for smaller values of T the entropy-based index is more sensitive, since it distinguishes between pairs of molecules having almost identical T. To adapt this methodology for large-scale virtual screening, Proportional Shannon Entropy (PSE) metrics were introduced.209 A key feature of this approach is that class-specific PSE of random fragment distributions enables the identification of the molecules sharing with known active compounds a significant number of signature substructures.
Similarity search methods developed for individual compounds are difficult to apply directly for chemical reactions involving many species subdivided by two types: reactants and products. To overcome this problem, Varnek et al.18 suggested condensing all participating reaction species in one molecular graph [Condensed Graphs of Reactions (CGR),18 see Section 1.3.2] followed by its fragmentation and application of developed fingerprints in “classical” similarity search. Besides conventional chemical bonds (simple, double, aromatic, etc.), a CGR contains dynamical bonds corresponding to created, broken or transformed bonds. This approach could be efficiently used for screening of large reaction databases.
1.4.3 SAR Classification (Probabilistic) Models
Simplistic and heuristic similarity-based approaches can hardly produce as good predictive models as modern statistical and machine learning methods that are able to assess quantitatively biological or physicochemical properties. QSAR-based virtual screening consists of direct assessment of activity values (numerical or binary) of all compounds in the database followed by selection of hits possessing desirable activity. Mathematical methods used for models preparation can be subdivided into classification and regression approaches. The former decide whether a given compound is active, whereas the latter numerically evaluate the activity values. Classification approaches that assess probability of decisions are called probabilistic.
Various classification approaches have been reported to be used successfully in conjunction with fragment descriptors for building classification SAR models: the Linear Discriminant Analysis (LDA),266,267 the Partial Least Square Discriminant Analysis (PLS-DA),268 Soft Independent Modeling by Class Analogy (SIMCA),269 Artificial Neural Networks (ANN),270 Support Vector Machines (SVM),271 Decision Trees (DT), 269,272,273 Spline Fitting with Genetic Algorithm (SFGA),269 etc. Probabilistic methods usually used with fragment descriptors are: Naïve Bayes (NB)142 and its modification implemented in PASS,126 Binary Kernel Discrimination,6 Inductive Logic Programming (ILP),274 Support Vector Inductive Logic Programming (SVILP),133 etc.
Numerous studies have been devoted to classification (probabilistic) approaches used in conjunction with fragment descriptors for virtual screening. Here we present several examples.
Harper et al. 6 have demonstrated a much better performance of probabilistic “binary kernel discrimination” method to screen large databases compared to backpropagation neural networks or conventional similarity search. The Carhart's atom-pairs154 and Nilakantan's topological torsions95 were used as descriptors.
Aiming to discover new cognition enhancers, Geronikaki et al.275 applied the PASS program,126 which implements a probabilistic Bayesian-based approach, and the DEREK rule-based system210 to screen a database of highly diverse chemical compounds. Eight compounds with the highest probability of cognition-enhancing effect were selected. Experimental tests showed that all of them possess a pronounced antiamnesic effect.
Bender, Glen et al. have applied129–133 several probabilistic machine learning methods (naïve Bayesian classifier, inductive logic programming, and support vector inductive learning programming) in conjunction with circular fingerprints for making classification of bioactive chemical compounds and performing virtual screening on several biological targets. The latter of these three methods (i.e., support vector inductive learning programming) performed significantly better than the other two methods.133 The advantages of using circular fingerprints were pointed out.131
1.4.4 QSAR/QSPR Regression Models
The Multiple Linear Regression (MLR) method was historically the first and to date the most popular method used to develop QSAR/QSPR models with fragment descriptors (Figure 1.14). Linear models involving fragments are built in several program packages: CASE,66–69 MULTICASE,97,98 TRAIL,101,102 ISIDA,18 EMMA,276 QSAR Builder from Pharma Algorithms277 and some others. The Partial Least Squares (PLS) regression,278,279 an alternative technique for building linear quantitative models, has also been successfully coupled with fragment descriptors.63,128,280–282 This approach is efficiently used the Holographic QSAR (HQSAR)63 (implemented in the Sybyl software253 ) and the “Generalized Fragment-Substructure Based Property Prediction Method”.282 The success of treating the fragment descriptors in PLS is explained by efficient handling of multicollinearity, which is a typical problem of fragment descriptors. Two other methods, the Group Method of Data Handling (GMDH)283 and the more recent Maximal Margin Linear Programming Method (MMLPM),284,285 also displayed their efficiency in building the linear models from an initial pool of highly correlated fragment descriptors.
Among nonlinear regression methods used in conjunction with fragment descriptors, the Back-Propagation Neural Networks (BPNN)286–289 occupy a special place. It has been proved7,8 that any molecular graph invariant can be approximated by an output of a BPNN using fragment descriptors as an input. Indeed, numerous studies have shown that the BPNN models based on fragment descriptors efficiently predict various physicochemical properties16,290–294 and some biological activities16,163,295 of organic compounds. A popular ASNN (Associative Neural Networks) approach consists of an ensemble of BPNN coupled with kNN correction in the space of models.296 This technique, together with fragment descriptors, has been successfully used to model the thermodynamic parameters of metal complexation285 and melting point of ionic liquids.297 Besides, the Radial Basis Function Neural Networks298 (RBFNNs) have also been used with fragment descriptors for predicting the properties of organic compounds.285,299 The Support Vector Regression (SVR) technique300–303 is a serious “competitor” of neural networks, as has been demonstrated in QSAR/QSPR studies285,304 involving fragment descriptors.
In drug design, regression QSAR/QSPR models are often used to assess ADME/Tox properties or to detect “hit” molecules capable of binding a certain biological target. Thus, one could mention fragments based QSAR models for blood–brain barrier,305 skin permeation rate,306 blood–air307 and tissue-air partition coefficients.307 Many theoretical approaches to calculating the octanol–water partition coefficient log P involve fragment descriptors. In particular, it concerns the methods by Rekker,308,309 Leo and Hansch (CLOGP),245,310 Ghose-Crippen (ALOGP),81–83 Wildman and Crippen,86 Suzuki and Kudo (CHEMICALC-2),87 Convard (SMILOGP)88 and by Wang (XLOGP).89,90 Fragment-based predictive models for estimation of solubility in water311 and DMSO311 are also available.
Benchmarking studies on various biological and physicochemical properties305–307,312 show that QSAR/QSPR models for involving fragment descriptors in many cases outperform those built on topological, quantum, electrostatic and other types of descriptors.
1.4.5 In Silico Design
In this section we consider several examples of virtual screening performed on a database containing only virtual (still non-synthesized or unavailable) compounds. Virtual libraries are usually generated using combinatorial chemistry approaches.313–315 One of simplest ways is to attach systematically user-defined substituents R1, R2,…, RN to a given scaffold. If the list for the substituent Ri contains ni candidates, the total number of generated structures is:
although taking symmetry into account could reduce the library's size. The number of substituents Ri (ni) should be carefully selected to avoid generation of too large a set of structures (combinatorial explosion). The “optimal” substituents could be prepared using fragments selected at the QSAR stage, since their contributions to activity (for linear models) allow one to estimate an impact of combining the fragment into larger species (Ri). In such a way, a focused combinatorial library could be generated.
The technology based on combining QSAR, generation of virtual libraries and screening stages has been implemented in the ISIDA program and applied to computer-aided design of new uranyl binders belonging to two different families of organic molecules: phosphoryl containing podands316 and monoamides.317 QSAR models have been developed using different machine-learning methods (multi-linear regression analysis, associative neural networks296 and support vector machines301 ) and fragment descriptors (atom/bond sequences and augmented atoms). These models were then used to screen virtual combinatorial libraries containing up to 11000 compounds. Selected hits were synthesized and tested experimentally. Predicted uranyl binding affinity was shown to agree well with the experimental data. Thus, initial data sets were significantly enriched with new efficient uranyl binders, and one of new molecules was found to be more efficient than previously studied compounds. A similar study was conducted for the development of new 1-(2-hydroxyethoxy)methyl)-6-(phenylthio)thymine (HEPT) derivatives potentially possessing high anti-HIV activity.318 This demonstrates the universality of fragment descriptors and the broad perspectives of their use in virtual screening and in silico design.
1.5 Limitations of Fragment Descriptors
Despite the many advantages of fragment descriptors they are not devoid of certain drawbacks, which deserve serious attention. Two main problems should be mentioned: (i) “missing fragments”;319 and (ii) modeling of stereochemically dependent properties.
The term “missing fragments” concerns comparison of the lists of fragments generated for the training and test sets. A test set molecule may contain fragments that, on one hand, belong to the same family of descriptors used for the modeling, and, on the other hand, are different from those in the initial pool calculated for the training set. The question arises whether the model built from that initial pool can be applied to those test set molecules? This is a difficult problem because a priori it is not clear if the “missing fragments” are important for the property being predicted. Several possible strategies to treat this problem have been reported. The ALOGPS program,320 predicting lipophilicity and aqueous solubility of chemical compounds, flags calculations as unreliable if the analyzed molecule contains one or more E-state atom or bond types missed in the training set. In such a way, the program detects about 90% of large prediction errors.319 The ISIDA program18 calculates a consensus model as an average over the “best” models developed with different sets of fragment descriptors. Each model corresponds to its “own” initial pool of descriptors. If a new molecule contains fragments different from those in that pool, the corresponding model is ignored. As demonstrated by benchmarking studies,285 this improves the predictive performance of the method. For each model, the NASAWIN software99 creates a list of “important” fragments including cycles and all one-atom fragments. The test molecule is rejected if its list of “important” fragments contains those absent in the training set.321 The LOGP program for lipophilicity predictions322 uses a set of empirical rules to calculate the contribution of missed fragments.
The second problem of using fragment descriptors deals with accounting for stereochemical information. In fact, its adequate treatment is not possible at the graph-theoretical level and requires explicit consideration of hypergraphs.323 However, in practice, it is sufficient to introduce special labels indicating stereochemical configuration of chiral centers or (E/Z)-isomers around a double bond, and then to use them in the specification of molecular fragments. Such an approach has been used in hologram fragment descriptors324 as well as in the PARTAN language.238
Fragment descriptors constitute one of the most universal types of molecular descriptors. The scope of their application encompasses almost all existing areas of SAR/QSAR/QSPR studies. Their universality stems from the basic character of structural theory in chemistry as well as from the fundamental possibility of molecular graph invariants being expressed in terms of subgraph occurrence numbers.8 The main advantages of fragment descriptors lie in the simplicity of their computation, the easiness of their interpretation as well as in efficiency of their applications in similarity searches and SAR/QSAR/QSPR modeling. Progress of their use in virtual screening could be related to the development of new types of fragments and of new mathematical approaches of their processing.
The authors thank GDRE SupraChem and ARCUS “Alsace –Russia/Ukraine” project for support and also Dr V. Solov’ev for fruitful discussions.