Skip to Main Content
Skip Nav Destination

Chemoinformatics1–5  is an emerging science that concerns the mixing of chemical information resources to transform data into information, and information into knowledge. It is a branch of theoretical chemistry based on its molecular model, and which uses its own basic concepts, learning approaches and areas of application. Unlike quantum chemistry, which considers molecules as ensemble of electrons and nuclei, or force field molecular mechanics or dynamics simulations based on a classical molecular model (“atoms” and “bonds”), chemoinformatics represents molecules as objects in a chemical space defined by molecular descriptors. Among thousands of descriptors, fragment descriptors occupy a special place. Fragment descriptors represent selected subgraphs of a 2D molecular graph; structure–property approaches use their occurrences in molecules or binary values (0, 1) to indicate their presence or absence in the given graph.

The unique properties of fragment descriptors are related to the fact that (i) any molecular graph invariant (i.e., any molecular descriptor or property) can be uniquely represented as a linear combination of fragment descriptors;7–9  (ii) any symmetric similarity measure can be uniquely expressed in terms of fragment descriptors;10,11  and (iii) any regression or classification structure–property model can be represented as a linear equation involving fragment descriptors.12,13 

An important advantage of fragment descriptors is related to the simplicity of their calculation, storage and interpretation (see review articles14–18 ). They belong to information-based descriptors,19  which tend to code the information stored in molecular structures. This contrasts with knowledge-based (or semi-empirical) descriptors derived from consideration of the mechanism of action. Owing to their versatility, fragment descriptors can efficiently be used to build structure–property models, perform similarity search, virtual screening and in silico design of chemical compounds with desired properties.

This chapter reviews fragment descriptors with respect to their use in structure–property studies, similarity search and virtual screening. After a short historical survey, different types of fragment descriptors are considered thoroughly. This is followed by a brief review of the application of fragment descriptors in virtual screening, focusing mostly on filtering, similarity search and direct activity/property assessment using quantitative structure–property models.

Among a multitude of descriptors currently used in Structure–Activity Relationships/Quantitative Structure–Activity Relationships/Quantitative Structure–Property Relationships (SAR/QSAR/QSPR) studies,20  fragment descriptors occupy a special place. Their application as atoms and bonds increments in the framework of additive schemes can be traced back to the 1930–1950s; Vogel,21  Zahn,22  Souders,23,24  Franklin,25,26  Tatevskii,27,28  Bernstein,29  Laidler,30  Benson and Buss31  and Allen32  pioneered this field. Smolenskii was one of the first, in 1964, to apply graph theory to tackle the problem of predictions of the physicochemical properties of organic compounds.33  Later on, these first additive schemes approaches have gradually evolved into group contribution methods. The latter are closely linked with thermodynamic approaches and, therefore, they are applicable only to a limited number of properties.

The epoch of QSAR (Quantitative Structure–Activity Relationships) studies began in 1963–1964 with two seminal approaches: the σ-ρ-π analysis of Hansch and Fujita34,35  and the Free–Wilson method.36  The former approach involves three types of descriptors related to electronic, steric and hydrophobic characteristics of substituents, whereas the latter considers the substituents themselves as descriptors. Both approaches are confined to strictly congeneric series of compounds. The Free–Wilson method additionally requires all types of substituents to be sufficiently present in the training set. A combination of these two approaches has led to QSAR models involving indicator variables, which indicate the presence of some structural fragments in molecules.

The non-quantitative SAR (Structure–Activity Relationships) models developed in the 1970s by Hiller,37,38  Golender and Rosenblit,39,40  Piruzyan, Avidon et al.,41  Cramer,42  Brugger, Stuper and Jurs,43,44  and Hodes et al.45  were inspired by the, at that time, popular artificial intelligence, expert systems, machine learning and pattern recognition paradigms. In those approaches, chemical structures were described by means of indicators of the presence of structural fragments interpreted as topological (or 2D) pharmacophores (biophores, toxophores, etc.) or topological pharmacophobes (biophobes, toxophobes, etc.). Chemical compounds were then classified as active or inactive with respect to certain types of biological activity.

Methodologies based on fragment descriptors in QSAR/QSPR studies are not strictly confined to particular types of properties or compounds. In the 1970s Adamson and coworkers46,47  were the first to apply fragment descriptors in multiple linear regression analysis to find correlations with some biological activities,48,49  physicochemical properties,50  and reactivity.51 

An important class of fragment descriptors, the so-called screens (or structural keys, fingerprints), were also developed in 1970s.52–56  As a rule, they represent the bit strings that can effectively be stored and processed by computers. Although their primary role is to provide efficient substructure searching in large chemical structure databases, they can be efficiently used also for similarity searching,57,58  clustering large chemical databases,59,60  assessing their diversity,61  as well as for SAR62  and QSAR63  modeling.

Another important contribution was made in 1980 by Cramer who invented BC(DEF) parameters obtained by means of factor analysis of the physical properties of 114 organic liquids. These parameters correlate strongly with various physical properties of diverse liquid organic compounds.64  On the other hand, they could be estimated by linear additive-constitutive models involving fragment descriptors.65  Thus, a set of QSPR models encompassing numerous physical properties of diverse organic compounds has been developed using only fragment descriptors.

One of the most important developments of the 1980s was the CASE (Computer-Automated Structure Evaluation) program by Klopman et al.66–69  This “self-learning artificial intelligent system”69  can recognize activating and deactivating fragments (biophores and biophobes) with respect to the given biological activity and to use this information to determine the probability that a test chemical is active. This methodology has been successfully applied to predict various types of biological activity: mutagenicity,67,70,71  carcinogenicity,66,69,71–73  hallucinogenic activity,74  anticonvulsant activity,75  inhibitory activity with respect to sparteine monooxygenase,76  β-adrenergic activity,77  μ-receptor binding (opiate) activity,78  antibacterial activity,79  antileukemic activity,80 etc. Using the multivariate regression technique, CASE can also build quantitative models involving fragment descriptors.72,77 

Starting in the early 1990s, various approaches and related software tools based on fragment descriptors have been developed and are listed in several conceptual and mini-review papers.14–18  Because of the wide scope and large variety of different approaches and applications in this field, many important ideas were reinvented many times and continue to be reinvented. In this review we try to present a clear state-of-the-art picture in this area.

In this section different types of fragments are classified with respect to their topology and the level of abstraction of molecular graphs.

A tremendous number of various fragments are used in structure–property studies: atoms, bonds, “topological torsions”, chains, cycles, atom- and bond-centered fragments, maximum common substructures, line notation (WLN and SMILES) fragments, atom pairs and topological multiplets, substituents and molecular frameworks, basic subgraphs, etc. Their detailed description is given below.

Depending on the application area, two types of values taken by fragment descriptors are considered: binary and integer. Binary values indicate the presence (true, yes, 1) or the absence (false, no, 0) of a given fragment in a structure. They are usually used as screens and elements of fingerprints for chemical database management and virtual screening using similarity-based approaches as well as in SAR studies. Integer values corresponding to the occurrences of fragments in structures are used in QSAR/QSPR modeling.

Disconnected atoms represent the simplest type of fragments. They are used to assess a chemical or biological property P in the framework of an additive scheme based on atomic contributions:

Equation 1.1

where ni is the number of atoms of i-type, Ai is corresponding atomic contributions. Usually, the atom types account for not only the type of chemical element but also hybridization, the number of attached hydrogen atoms (for heavy elements), occurrence in some groups or aromatic systems, etc. Nowadays, atom-based methods are used to predict some physicochemical properties and biological activities. Thus, several works have been devoted to assess the octanol–water partition coefficient log P: the ALOGP method by Ghose-Crippen,81–83  later modified by Ghose and co-workers,84,85  and by Wildman and Crippen,86  the CHEMICALC-2 method by Suzuki and Kudo,87  the SMILOGP program by Convard and co-authors,88  and the XLOGP method by Wang and co-authors.89,90  Hou and co-authors91  used Equation (1.1) to calculate aqueous solubility. The ability of this approach to assess biological activities was demonstrated by Winkler et al.92 

Chemical bonds are another type of simple fragment. The first bond-based additive schemes, such as those of Zahn,22  Bernstein29,93  and Allen,32,94  appeared almost simultaneously with the atom-based ones and dealt, presumably, with predictions of some thermodynamic properties.

“Topological torsions” invented Nilakantan et al.95  are defined as a linear sequence of four consecutively bonded non-hydrogen atoms. Each atom there is described by the type of corresponding chemical element, the number of attached non-hydrogen atoms and the number of π-electron pairs. Molecular descriptors indicating the presence or absence of topological torsions in chemical structures have been used to perform qualitative predictions of biological activity in structure–activity (SAR) studies.95  Later on, Kearsley et al.96  recognized that characterizing atoms by element types can be too specific for similarity searching and, therefore, it does not provide sufficient flexibility for large-scaled virtual screening. To solve this problem, they suggested assigning atoms in the Carhart's atom pairs and Nilakantan's topological torsions to one of seven classes: cations, anions, neutral hydrogen bond donors, neutral hydrogen bond acceptors, polar atoms, hydrophobic atoms and other.

The above-mentioned structural fragments – atoms, bonds and topological torsions – can be regarded as chains of different lengths. Smolenskii33  suggested using the occurrences of chains in an additive scheme to predict the formation enthalpy of alkanes. For the last four decades, chain fragments have proved to be one of the most popular and useful type of fragment descriptors in QSPR/QSAR/SAR studies. Fragment descriptors based on enumerating chains in molecular graphs are efficiently used in many popular structure–property and structure–activity programs: CASE66–69  and MULTICASE (MultiCASE, MCASE) by Klopman97,98  NASAWIN99  by Baskin et al., BIBIGON100  by Kumskov, TRAIL101,102  and ISIDA18  by Solov’ev and Varnek. “Molecular pathways” by Gakh and co-authors,103  and “molecular walks” by Rücker,104  represent chains of atoms.

In contrast to chains, cyclic and polycyclic fragments are relatively rarely applied as descriptors in QSAR/QSPR studies. Nevertheless, implicitly cyclicity is accounted for by means of: (i) introducing special “cyclic” and “aromatic” types of atoms and bonds, (ii) “collapsing” the whole cycles and even polycyclic systems into “pharmacophoric” pseudo-atoms and (iii) generating cyclic fragments as a part of large fragments [Maximum Common Substructure (MCS), molecular framework, substituents]. Besides, the cyclic fragments are widely used as screens for chemical database processing.105,106 

WLN and SMILES fragments correspond respectively to substrings of the Wiswesser Line Notation107  or Simplified Molecular Input Line Entry System108,109  strings used for encoding the chemical structures. Since simple string operations are much faster than processing of information in connection tables, the use of WLN descriptors was justified in the 1970s when computers were still very slow. At that time Adamson and Bawden published some linear QSAR models based on WLN fragments.48,50,51,110,111  They have also applied this kind of descriptor for hierarchical cluster analysis and automatic classification of chemical structures.112  Qu et al.113,114  have developed AES (Advanced Encoding System), a new WLN-based notation encoding chemical information for group contribution methods. Interest in line notation descriptors has not disappeared completely with the advent of powerful computers. Thus, SMILES fragment descriptors are used in the SMILOGP program to predict log P,88  whereas the recently developed LINGO system for assessing some biophysical properties and intermolecular similarities uses holographic representations of canonical SMILES strings.115 

Atom-Centered Fragments (ACF) consist of a single central atom surrounded by one or several shells of atoms separated from the central one by the same topological distance. This type of structural fragments was introduced in the early 1950s by Tatevskii,27,28,116–119  and then by Benson31  to predict some physicochemical properties of organic compounds in the framework of additive schemes.

ACF fragments containing only one shell of atoms around the central one (i.e., atom-centered neighborhoods of radius 1) were introduced into chemoinformatics practice in 1971 under the names “atom-centered fragments” and “augmented atoms” by Adamson,120,121  who studied their distribution in large chemical databases with the intention of using them as screens in chemical database searching. Hodes used, in SAR studies, both “augmented atoms”45  and “ganglia augmented atoms”325  representing ACF fragments with radius 2 and generalized second-shell atoms. Subsequently, ACF fragments with radius 1 were implemented in NASAWIN,122–124  TRAIL101,102,125  and ISIDA18  programs. ACF fragments with arbitrary radius were implemented by Filimonov, Poroikov and co-authors in the PASS126  program under the name Multilevel Neighborhoods of Atoms (MNA),127  by Xing and Glen as “tree structured fingerprints”,128  by Bender and Glen as “atom environments”129,130  and “circular fingerprints”131–133  (Figure 1.1), and by Faulon as “molecular signatures”.134–136 

Figure 1.1

Circular fingerprints with Sybyl mol2 atom typing. An individual fingerprint is calculated for each atom in the molecule, considering those atoms up to two bonds from the central atom (level 2). The molecular fingerprint consists of the individual atom fingerprints of all the heavy atoms in the structure. (Adapted from ref. 132.)

Figure 1.1

Circular fingerprints with Sybyl mol2 atom typing. An individual fingerprint is calculated for each atom in the molecule, considering those atoms up to two bonds from the central atom (level 2). The molecular fingerprint consists of the individual atom fingerprints of all the heavy atoms in the structure. (Adapted from ref. 132.)

Close modal

Several types of ACF fragments were designed to store local spectral parameters (chemical shifts) in spectroscopy data bases. Thus, Bremser has developed Hierarchically Ordered Spherical Environment (HOSE), a system of substructure codes aimed at characterizing the spherical environment of single atoms and complete ring systems.137  The codes are generated automatically from 2D graphs and describe structural entities corresponding to chemical shifts. A very similar idea has also been implemented by Dubois et al. in the DARC system based on FREL (Fragment Réduit à un Environment Limité) fragments.138,139  Xiao et al. have applied Atom-Centered Multilayer Code (ACMC) fragments for structural and substructural searching in large databases of compounds and reactions.140  An important recent application of ACF fragments concerns target prediction (“target fishing”) in chemogenomic data analysis.126,141,142 

Bond-centered fragments (BCF) consist of two atoms linked by the bond and surrounded by one or several shells of atoms separated by the same topological distance from this bond. Although these fragments are rather rarely used in structure–property studies, they can be efficiently used as screens for chemical database processing.143  BCF have been used as a part of MDL keys144,145  for substructure search in chemical databases, database clustering60  and for SAR studies of 17 different types of biological activity.62  Bond-centered fragments have also been used in the DARC system.138,139 

For a set of molecular graphs, a Maximum Common Substructure (MCS) is defined as a largest substructure in all graphs belonging to the given set. In most practical applications, only MCS for graph pairs are considered, i.e., for sets containing only two graphs. MCS can be found by intersecting molecular graphs using several different algorithms (for a review see ref. 146), the best known of which involve clique detection in so-called compatibility graphs. Notably, a pair of graphs can have more than one MCS. The main advantage of MCS fragments is related to the fact that their complexity is not limited and therefore they can be used to detect property-relevant features that could not be detected by fragments (subgraphs) of limited complexity.

MCSs were first applied to SAR studies in the early 1980s by Rozenblit and Golender in the framework of their logical-combinatorial approach.40,41,147  Since at that time computer power was limited, the authors suggested the use of reduced graphs (Section 1.3.5) built on pharmacophoric centers. The MCS fragments were subsequently applied to perform a similarity search,148  to cluster chemical databases149,150  as well to assess biological activities of organic compounds.99,151,152 

Characterizing atoms only by element types is too specific for similarity searching and, therefore, does not provide sufficient flexibility for large-scale virtual screening. For that reason, numerous studies have been devoted to increase the informational content of fragment descriptors by adding some useful empirical information and/or by representing a part of the molecular graph implicitly. The simplest representatives of such descriptors were “atom pairs and topological multiplets” based on the notion of a “descriptor center” representing an atom or a group of atoms that could serve as centers of intermolecular interactions. Usually, descriptor centers include heteroatoms, unsaturated bonds and aromatic cycles. An atom pair is defined as a pair of atoms (AT) or descriptor centers separated by a fixed topological distance: ATi-Dist-ATj, where Distij is the shortest path (the number of bonds) between ATi and ATj. Analogously, a topological multiplet is defined as a multiplet (usually triplet) of descriptor centers and topological distances between each pair of them. In most of cases, these descriptors are used in binary form to indicate the presence or absence of the corresponding features in studied chemical structures.

Atom pairs were first suggested for SAR studies by Avidon as Substructure Superposition Fragment Notation (SSFN).41,153  They were then independently reinvented by Carhart and co-authors154  for similarity and trend vector analysis. In contrast to SSFN, Carhart's atom pairs are not necessarily composed only of descriptor centers but account for the information about element type, the number of bonded non-hydrogen neighbors and the number of π electrons. Nowadays, Carhart's atom pairs are popular in virtual screening. Topological Fuzzy Bipolar Pharmacophore Autocorrelograms (TFBPA)155  by Horvath are based on atom pairs, in which real atoms are replaced by pharmacophore sites (hydrophobic, aromatic, hydrogen bond acceptor, hydrogen bond donor, cation, anion), while Distij corresponds to different ranges of topological distances between pharmacophores. These descriptors were successfully applied in virtual screening against a panel of 42 biological targets using a similarity search based on several fuzzy and non-fuzzy metrics,156  performing only slightly less well than their 3D counterparts.155  Fuzzy Pharmacophore Triplets (FPT) by Horvath157  is an extension of FBPF156  for three-site pharmacophores. An important innovation in the FPT concerns accounting for proteolytic equilibrium as a function of pH.157  Owing to this feature, even small structural modifications leading to a pKa shift may have a profound effect on the fuzzy pharmocophore triples. As a result, these descriptors efficiently discriminate structurally similar compounds exhibiting significantly different activities.157 

Some other topological triplets should be mentioned. Similog pharmacophoric keys by Schuffenhauer et al.158  represent triplets of binary coded types of atoms (pharmacophoric centers) and topological distances between them (Figure 1.2). Atomic types are generalized by four features (represented as four bits per atom): potential hydrogen bond, donor or acceptor, bulkiness and electropositivity. The “topological pharmacophore-point triangles” implemented in the MOE software159  represent triplets of MOE atom types separated by binned topological distances. Structure–property models obtained by a support vector machine method with these descriptors have been successfully used for virtual screening of COX-2 inhibitors160  and D3 dopamine receptor ligands.161 

Figure 1.2

Example of a Similog key. (Adapted from ref. 158.)

Figure 1.2

Example of a Similog key. (Adapted from ref. 158.)

Close modal

In organic chemistry, decomposition of molecules into substituents and molecular frameworks is a natural way to characterize molecular structures. In QSAR, both the Hansch–Fujita34,35  and the Free–Wilson36  classical approaches are based on this decomposition, but only the second one explicitly accounts for the presence or the absence of substituent(s) attached to molecular framework at a certain position. While the multiple linear regression technique was associated with the Free–Wilson method, recent modifications of this approach involve more sophisticated statistical and machine-learning approaches, such as the principal component analysis162  and neural networks.163 

In contrast to substituents, molecular frameworks are rarely used in SAR/QSAR/QSPR studies. In most cases, they are implicitly involved as indicator variables discriminating different types of molecular motifs (see, for example, ref. 164). The distributions of different molecular frameworks and substituents (side chains) in the databases of known drug molecules has been thoroughly studied by Bemis and Murcko.165,166 

Regarding fragment descriptors, one could imagine a huge number of possibilities to split a molecular graph into constituent fragments. Making a parallel with the decomposition of vectors into a limited number of basis functions, Randič326  suggested the existence of a small set of basic subgraphs representing any structure and which could be used to calculate any molecular property. In particular, for small alkanes a set of disconnected graphs representing paths (chains) of different length has been proposed (Figure 1.3).

Figure 1.3

Randič basic graphs for a maximum number of nodes of 7.

Figure 1.3

Randič basic graphs for a maximum number of nodes of 7.

Close modal

However, later it has since been found that this set is not sufficient to differentiate any two structures. Skvortsova et al. have extended the set of Randič basic subgraphs by including cyclic fragments and more complex subgraphs consisting of single node attached to a cyclic fragment.167  This set exhibits good coding uniqueness (i.e., different vectors of descriptors correspond to different structures) and coding completeness (i.e., they can approximate a numerous structure–property functions). Basic fragment descriptors of this kind were used in several QSPR studies.168 

In fact, a rigorous solution of the problem of finding a set of basic graph invariants was obtained by Mnukhin169  for simple graphs and then extended to molecular graphs by Baskin, Skvortsova et al.7–9  (Figure 1.4). It has been shown that the complete set of basic graph invariants could be built on all possible subgraphs, and hence one can not to confine this to any subset of limited size. Nonetheless, for many practical tasks the application of a limited number of basic subgraphs and the corresponding fragment descriptors could be useful.

Figure 1.4

Skvortsova's basic graphs for a maximum number of nodes of 5.

Figure 1.4

Skvortsova's basic graphs for a maximum number of nodes of 5.

Close modal

Another application of basic subgraphs arises from the possibility8,169  of relating the invariants of molecular graphs to the occurrence numbers of some basic subgraphs. Estrada has developed this methodology for spectral moments of the edge-adjacency matrix of molecular graphs – defined as the traces of the different powers of such matrix:170–172 

Equation 1.2

where μk is the k-th spectral moment of the edge-adjacency matrix E (which is a symmetric matrix whose elements eij are 1 only if edge i is adjacent to edge j) and tr is the trace, i.e. the sum of the diagonal elements of the matrix. On the other hand, spectral moments can be expressed as linear combinations of the occurrence numbers of certain structural fragments in the molecular graph. These linear combinations for simple molecular graphs not containing heteroatoms have been reported for acyclic170  and cyclic172  chemical structures.

To illustrate these notions, consider a correlation between the boiling points of alkanes and their spectral moments reported in ref. 170:

Equation 1.3

graphic
The first six spectral moments of the edge-adjacency matrix E are expressed as linear combinations of the occurrence numbers of fragments listed in Figure 1.5:

Equation 1.4
Equation 1.5
Equation 1.6
Equation 1.7
Equation 1.8
Equation 1.9

where |Fi| denotes the occurrence number of subgraph Fi in molecular graph.

Figure 1.5

First ten structural fragments contained in molecular graphs of alkanes. (Adapted from ref. 170.)

Figure 1.5

First ten structural fragments contained in molecular graphs of alkanes. (Adapted from ref. 170.)

Close modal

Thus, by substituting spectral moments in the QSPR Equation (1.4) for their expansions (Equations 1.5–1.10) one can obtain the following QSPR equation with fragment descriptors:

Equation 1.10

Thus, any spectral moment and hence the activities/properties of chemical compounds can be represented by contributions of corresponding fragments. This approach was further extended to molecular graphs containing heteroatoms by weighting the diagonal elements of the bond adjacency matrix.171 

This methodology has been implemented in TOSS-MODE (TOpological SubStructural MOlecular Design) and TOPS-MODE (TOPological Substructural MOlecular DEsign) methods,173  which were successfully used to assess various physicochemical properties of chemical compounds: retention indices in chromatography,174  diamagnetic and magnetooptic properties,175  dipole moments,176  permeability coefficients through low-density polyethylene,177 etc.), 3D-parameters178  and a different types of biological activity (sedative/hypnotic activity,173  anti-cancer activity,179  anti-HIV activity,180  skin sensitization,181  herbicide activity,182  affinity to A1 adenosine receptor,183  inhibition of cyclooxygenase,184  antibacterial activity,185  toxicity in Tetrahymena pyriformis,186  mutagenicity,187–189 etc.

The notion of mined subgraphs is closely linked to graph mining (or subgraph mining), a field of searching the graphs (subgraphs) specifically related to some properties or activities.190–195  The advantage of this approach is that all relevant fragments are available for analysis without the need to consider an almost infinite number of all possible subgraphs, which allows one to select the most “useful” fragments. This methodology196,197  is based on efficient algorithms for mining the most frequent fragments occurring in sets of molecular graphs, such as the AGM (Apriori-based Graph Mining) algorithm by Inokuchi et al.,198  the FSG (Frequent Sub-Graphs) algorithm by Kuramochi and Karypis,199  the chemical sub-structure discovery algorithm by Borgelt and Berthold,200  the gSpan (graph-based Substructure pattern mining) algorithm by Yan and Han,194  the TreeMiner algorithm by Zaki201  and the HybridTreeMiner and CMTreeMiner algorithms by Chi, Yang and Muntz,202,203 etc. The mined subgraphs approach was originally used to classify chemical structures.204,205  “Weighted substructure mining, in conjunction with linear programming boosting,206  allows one to build QSAR regression models involving mined fragment descriptors.195 

The success of different fragmentation schemes in SAR/QSAR studies strongly depends on the initial choice of relevant fragment types. Since it is unrealistic to consider all possible fragments because of their enormous number, one should always select their small subsets. However, any attempt to apply a limited subtype of them (e.g., to use only chains with the user specified length) risks being inefficient because of missing of important fragments. One possible solution is to generate substructural fragments using stochastic techniques. Such an approach has been used by Graham et al., who generated “tape recordings” of chemical structures from atom-bond-atom fragments extracted from molecular graphs by random walks.207  In the MolBlaster method by Batista, Godden and Bajorath, for each molecule the program generates a “random fragment profile” representing a population of fragments generated by randomly deleting bonds in hydrogen-suppressed molecular graph.208  This method was successfully applied in similarity-based virtual screening.209 

Many studies employ fixed sets of fragments taken from some libraries containing preliminary selected fragments. Thus, most additive schemes and group contribution methods have been derived using fixed sets of fragments. Some SAR/QSAR/QSPR expert systems also employ fixed sets of selected fragments and often apply an internal language specifically designed for handling the descriptors lists. For example, to describe fragments, the DEREK expert system for assessing toxicity uses the PATRAN language,210  whereas the ALogP method86  for predicting the octanol–water partition coefficient log P is based on the SMARTS line notation [as implemented in the MOE (Molecular Operating Environment) software suite159 ].

Using “special” bond types, molecular graphs can represent not only individual molecules but also more complex species: supramolecular systems, chemical reactions and polymers with periodic structure. For example, the ISIDA program can recognize a “coordination bond” between central metal atom and donor atoms of the ligand in the metal complexes and “hydrogen bond” in supramolecular assemblies.32  Varnek et al. used fragment descriptors derived from “supramolecular” graphs in QSPR modeling of free energy and enthalpy of formation of 1 : 1 hydrogen bonded complexes.18 

The concept of molecular graphs can also be expanded to describe chemical reactions by introducing special types of “dynamical” bonds corresponding to formation, modification and breaking of chemical bonds (for a review see ref. 211). The resulting reaction graph contains all necessary information to reconstruct both reactants and products in the corresponding reaction equation. Partial reaction graphs containing only “dynamical” bonds were used to classify and enumerate organic reactions in the framework of Ugi–Dugundji matrix formalism212  and the Zefirov–Tratch formal-logical approach.213,214  Vladutz condensed reactants and products of a chemical reaction into a single Superimposed Reaction Skeleton Graph (SRSG)215  containing both dynamical and conventional (not modified in the reaction) bonds. Similar reaction graphs under the name “imaginary transition state” were also suggested by Fujita216,217  for classification and enumeration of organic reactions. This approach has been extended recently by Varnek et al.18  in Condensed Graphs of Reactions (CGRs) containing both “dynamical” and conventional bonds (Figure 1.6). Fragment descriptors derived from CGRs were used in similarity search of reactions, in reaction classification and in the development of QSPR models of the rate constant of SN2 reactions in water.218 

Figure 1.6

Phenol acetylation and related Condensed Graph of Reaction. “Dynamical” bonds marked with green and red correspond, respectively, to formation and breaking a single bond.

Figure 1.6

Phenol acetylation and related Condensed Graph of Reaction. “Dynamical” bonds marked with green and red correspond, respectively, to formation and breaking a single bond.

Close modal

To encode reaction transformations Borodina et al. have developed Reacting Multilevel Neighborhood of Atom (RMNA)219  descriptors representing an extended version of the MNA descriptors. Unlike CGRs, where reaction information is condensed, in the RMNA approach the information about modified, created or broken bonds is added to the list of the MNA descriptors generated for all products and reactants. The RMNA descriptors were applied to predict metabolic P450-mediated aromatic hydroxylation.219 

This section discusses different techniques to store the information about molecular fragments. The most common way is present a given chemical structure as a fixed-size array (vector), in which each element corresponds to the occurrence of a given molecular fragment. Structural keys are descriptor vectors containing binary values indicating presence of absence of fragments. Since structural keys can be kept in computer memory as bit strings they are processed very rapidly, which explains their popularity in chemical database management, similarity search, SAR/QSAR studies and in virtual screening (Figure 1.7).

Figure 1.7

Generation of structural keys for a molecule of aspirin.

Figure 1.7

Generation of structural keys for a molecule of aspirin.

Close modal

The composition and length of structural keys always depend on the choice of constituent fragments. Often, structural keys become very sparse, i.e., they contain very few non-zero values. Such highly imbalanced data presentation is rather inefficient for computer processing. As a partial solution to this problem, fragment descriptors can be stored in a list containing the codes (names) of fragments “ON”. Although application of lists reduces the storage's size, it is still time consuming to be used for a substructural search in large databases.

Search efficiency can be improved significantly by using hash tables, allowing one to link directly the name of descriptor and location of the descriptor's value. This technology is used in hashed molecular fingerprints operating with binary values (Figure 1.8). In contrast to structural keys, in molecular fingerprints each fragment is mapped onto several cells, positions of which are computed from the fragment code. The advantage of hashed fingerprints is a possibility to include a big number of fragments in a bit string of reasonable length. Their drawback is related to the existence of collisions when two or more fragments are mapped in the same bit. Nonetheless, this problem could be solved by trade-off between the length of bit string, the number of fragments types and the number of bits allocated for each fragment.

Figure 1.8

Generation of hashed fingerprints. Each fragment leads to “switching on” of several bits. A bit with collisions is underlined and shown in bold.

Figure 1.8

Generation of hashed fingerprints. Each fragment leads to “switching on” of several bits. A bit with collisions is underlined and shown in bold.

Close modal

An interesting way of encoding structural information is realized in molecular holograms, which represent an integer array of bins of predetermined length (hologram length) that contains information about the occurrences of fragments. In the course of generating a molecular hologram, each fragment is coded using the SLN (SYBYL Line Notation).220  Using the cyclic redundancy check (CRC) algorithm,221  this code is transformed into a fragment integer ID, indicating the location of the particular bin in the molecular hologram (Figure 1.9). The occupancy of bins is then incremented by one as soon as the corresponding fragments occur. Since the hologram length I always smaller than the number of fragments, several different fragments map to the same bin in the molecular hologram. The resulting bin occupancy is equal to the sum of occurrence numbers of all these fragments. Molecular holograms were specially designed to be used in the Holographic QSAR (HQSAR) approach.63 

Figure 1.9

Generation of a molecular hologram. A molecule is broken into several structural fragments that are assigned fragment integer identifications (IDs) using the CRC algorithm. Each fragment is then placed in a particular bin based on its fragment integer ID corresponding to the bin ID. The bin occupancy numbers are the molecular hologram descriptors that count structural fragments in each bin. (Adapted from ref. 63.)

Figure 1.9

Generation of a molecular hologram. A molecule is broken into several structural fragments that are assigned fragment integer identifications (IDs) using the CRC algorithm. Each fragment is then placed in a particular bin based on its fragment integer ID corresponding to the bin ID. The bin occupancy numbers are the molecular hologram descriptors that count structural fragments in each bin. (Adapted from ref. 63.)

Close modal

Fragments used for building fragment descriptors can be connected and disconnected. Most applications are based on connected fragments. The point is that the indicators of presence or occurrences of disconnected fragments can always be expressed through the corresponding values obtained for connected fragments.8  Hence, descriptors based on disconnected fragments are redundant, since they do not carry any additional information compared to their connected counterparts.

Nonetheless, in some cases disconnected fragments descriptors could simplify QSAR/QSPR equations. In particular, nonlinear models involving connected fragments can be replaced with linear models built on disconnected fragments, because the occurrences of disconnected and connected fragments are nonlinearly related. Thus, the use of disconnected fragments may be viewed as an implicit way of introducing nonlinearity into QSARs/QSPRs. If binary descriptor values are used, disconnected fragments implicitly introduce conjunctions (logical .AND.) into logical expressions instead of nonlinear terms for connected fragments. Tarasov et al.222  have shown that the compound structural descriptors defined as combinations of unrelated fragments improve significantly the efficiency of mutagenicity predictions. Implicitly, disconnected fragments, as conjugations of binary (logical) connected fragment descriptors, were used to build probabilistic SAR models for some biological activities (see ref. 223 and references therein).

In contrast to QSPR studies based on complete (containing all atoms) or hydrogen-suppressed molecular graphs, assessment of biological activity, especially at the qualitative level, often requires greater generalization. In that case, it is convenient to describe chemical structures by reduced graphs, in which each vertex – descriptor center or pharmacophoric center – represents an atom or a group of atoms capable of interacting with biological targets, whereas each edge measures the number of bonds between them. Such a biology-oriented representation of chemical structures was invented in 1982 by Avidon et al. under the name Descriptor Center Connection Graphs (DCCG)41  as a generalization of SSFN descriptors (Section 1.3.1.6).

Figure 1.10(b) shows the DCCG for phenothiazine. In this case, the reduced graph consists of 16 edges and 10 vertices corresponding to descriptor centers shown in Figure 1.10(a). Descriptor centers involve four heteroatoms (1–4; see numbering in Figure 1.10a), which can take part in donor–acceptor interaction with biomolecules and in the formation of hydrogen bonds, three methyl groups (5–7), which can take part in hydrophobic interaction with biomolecules, two benzene rings (8, 9) and one heterocycle (10), which can take part in π–π and π–cation interactions with biomolecules. Eleven edges in the DCCG labeled with positive numbers indicate the topological distances (counted as the number of bonds) between the atoms included in the corresponding descriptor centers, while the negative labels denote relations between rings within a polycyclic system. Such graphs are very useful not only as a source of biology-oriented fragment descriptors but also for pharmacophore based virtual screening.

Figure 1.10

(a) Structure of phenothiazine with descriptor centers marked on it. (Adapted from ref. 41.) (b) Descriptor center connection graph for phenothiazine. (Adapted from ref. 41.)

Figure 1.10

(a) Structure of phenothiazine with descriptor centers marked on it. (Adapted from ref. 41.) (b) Descriptor center connection graph for phenothiazine. (Adapted from ref. 41.)

Close modal

The atom-pairs proposed by Carhart et al.154  are rather similar to the SSFN descriptors. They can be considered as two-vertex connected fragments of reduced graphs, in which edges correspond to paths between certain atoms. Modifications introduced to the atom-pairs descriptors by Kearsley et al.96  through encoding physicochemical properties of atoms render these fragments even more generic. In 2003 Gillet, Willett and Bradshaw (GWB) introduced another type reduced graphs and proved their high efficiency in a similarity search.224  A GWB reduced graph consisting of six vertices and five edges is shown in Figure 1.11. Its three vertices R correspond to rings, its two vertices L to linkers, while the vertex F corresponds to a feature – an oxygen atom in this case, which can form hydrogen bonds. In contrast to DCCG, the edges of GWB reduced graphs are not labeled and correspond to ordinary chemical bonds.

Figure 1.11

Examples of chemical structures corresponding to the same GWB reduced graph of type R/F (shown in center). (Adapted from ref. 224.)

Figure 1.11

Examples of chemical structures corresponding to the same GWB reduced graph of type R/F (shown in center). (Adapted from ref. 224.)

Close modal

An important feature of the GWB reduced graphs is a hierarchical organization of vertex labels. For example, the label Arn (non-hydrogen-bonding aromatic cycle) is less general than the label Ar (any aromatic cycle), which, in turn, is less general than R (any ring). Due to this feature, GWB reduced graphs can also be organized hierarchically, and the level of their generalization can be controlled (Figure 1.12). Besides similarity searching, fragment descriptors based on GWB reduced graphs have been applied to derive SAR models using decision trees.225 

Figure 1.12

A hierarchy of GWB reduced graphs. (Adapted from ref. 224.)

Figure 1.12

A hierarchy of GWB reduced graphs. (Adapted from ref. 224.)

Close modal

In some cases selected atoms in molecules could be marked with special labels, indicating their particular role in a modeled property. Some examples are (i) local properties, such as atomic charges or NMR chemical shifts, which should always be attributed to a given atom(s), (ii) anchor atoms in the given scaffold to which substituents are attached (Figure 1.13), (iii) atoms forming a main chain in polymers and (iv) reaction centers in a set of reactions. Zefirov et al. have applied labeling in QSPR studies of pKa226,227  chemical NMR shifts and reaction rate constant for the acid hydrolysis of esters.226,228  Varnek et al.18  labeled hydrogen bond donor and acceptor centers to model free energies and enthalpies of formation of the 1 : 1 hydrogen-bond complexes.

Figure 1.13

Examples of fragments with marked atoms used for modeling inhibitor activity against HIV-I reverse transcriptase for a congeneric set of HEPT derivatives.

Figure 1.13

Examples of fragments with marked atoms used for modeling inhibitor activity against HIV-I reverse transcriptase for a congeneric set of HEPT derivatives.

Close modal

This section considers the application of fragment descriptors at different stages of virtual screening and in silico design.

Filtering is a rule-based approach aimed to perform fast assessment of usefulness of molecules in the given context. In terms of drug design, the filtering is used to eliminate compounds with unfavorable pharmacodynamic or pharmacokinetic properties as well as toxic compounds. Pharmacodynamics considers binding drug-like organic molecules (ligands) to chosen biological target. Since the efficiency of ligand–target interactions depends on spatial complementarity of their binding sites, the filtering is usually performed with 3D-pharmacophores, representing “optimal” spatial arrangements of steric and electronic features of ligands.229,230  Pharmacokinetics is mostly related to absorption, distribution, metabolism and excretion (ADME) related properties: octanol–water partition coefficients (log P), solubility in water (log S), blood–brain coefficient (log BB), partition coefficient between different tissues, skin penetration coefficient, etc.

Fragment descriptors are widely used for early ADME/Tox prediction both explicitly and implicitly. The easiest way to filter large databases concerns detecting undesirable molecular fragments (structural alerts). Appropriate lists of structural alerts are published for toxicity,231  mutagenicity,232  and carcinogenicity.233  Klopman et al. were the first to recognize the potency of fragment descriptors for this purpose.66,67,69  Their programs CASE,66  MultiCASE,97,234  as well as more recent MCASE QSAR expert systems,235  proved to be effective tools to assess the mutagenicity67,234,235  and carcinogenicity69,234  of organic compounds. In these programs, sets of biophores (analogs of structural alerts) were identified and used for activity predictions. Several more sophisticated fragment-based expert systems of toxicity assessment – DEREK,210  TopKat236  and Rex237  – have been developed. DEREK is a knowledge-based system operating with human-coded or automatically generated238  rules concerning toxicophores. Fragments in the DEREK knowledge base are defined by means of the linear notation language PATRAN, which codes the information about atom, bonds and stereochemistry. TopKat uses a large predefined set of fragment descriptors, whereas Rex implements a special kind of atom-pairs descriptors (links). For more information about fragment-based computational assessment of toxicity, including mutagenicity and carcinogenicity, see ref. 239 and references therein.

The most popular filter used in drug design area is the Lipinski “rule of five”,240  which takes into account the molecular weight, the number of hydrogen bond donors and acceptors, along with the octanol–water partition coefficient log P, to assess the bioavailability of oral drugs. Similar rules of “drug-likeness” or “lead-likeness” were later proposed by Oprea,241  Veber242  and Hann.243  Formally, fragment descriptors are not explicitly involved there. However, most computational approaches that assess log P are fragment-based;244–246  whereas H-donors and acceptor sites are the simplest molecular fragments.

The notion of molecular similarity (or chemical similarity) is one of the most useful and at the same time one of the most contradictory concepts in chemoinformatics.247,248  The concept of molecular similarity plays an important role in many modern approaches to predicting the properties of chemical compounds, designing chemicals with a predefined set of properties and, especially, in conducting drug design studies by screening large databases containing structures of available (or potentially available) chemicals. These studies are based on the similar property principle of Johnson and Maggiora, which states: similar compounds have similar properties.247  The similarity-based virtual screening assumes that all compounds in a database that are similar to a query compound have similar biological activity. Although this hypothesis is not always valid (see discussion in ref. 249), quite often the set of retrieved compounds is considerably enriched with actives.250 

To achieve high efficacy of similarity-based screening of databases containing millions compounds, molecular structures are usually represented by screens (structural keys) or fixed-size or variable-size fingerprints. Screens and fingerprints can contain both 2D- and 3D-information. However, the 2D-fingerprints, which are a kind of binary fragment descriptors, dominate in this area. Fragment-based structural keys, like MDL keys,62  are sufficiently good for handling small and medium-sized chemical databases, whereas processing of large databases is performed with fingerprints having much higher information density. Fragment-based Daylight,251  BCI,252  and UNITY 2D253  fingerprints are the best known examples.

The most popular similarity measure for comparing chemical structures represented by means of fingerprints is the Tanimoto (or Jaccard) coefficient T.254  Two structures are usually considered similar if T > 0.85250  (for Daylight fingerprints251 ). Using this threshold, Taylor estimated a probability to retrieve actives as 0.012–0.50,255  whereas according to Delaney this probability is even higher, i.e., 0.40–0.60 (ref. 256) (using Daylight fingerprints251 ). These computer experiments confirm the usefulness of the similarity approach as an instrument of virtual screening.

Schneider et al. have developed a special technique for performing virtual screening referred to as Chemically Advanced Template Search (CATS).257  Within its framework, chemical structures are described by means of so-called correlation vectors, each component of which is equal to the occurrence of a given atom pair divided by the total number of non-hydrogen atoms in it. Each atom in the atom pair is specified as belonging to one of five classes (hydrogen-bond donor, hydrogen-bond acceptor, positively charged, negatively charged, and lipophilic), while topological distances of up to ten bonds are also considered in the atom-pair specification. In ref. 257, the similarity is assessed by Euclidean distance between the corresponding correlation vectors. CATS has been shown to outperform the MERLIN program with Daylight fingerprints251  for retrieving thrombin inhibitors in a virtual screening experiment.257 

Hull et al. have developed the Latent Semantic Structure Indexing (LaSSI) approach to perform similarity search in low-dimensional chemical space.258,259  To reduce the dimension of initial chemical space, the singular value decomposition method is applied for the descriptor-molecule matrix. Ranking molecules by similarity to a query molecule was performed in the reduced space using the cosine similarity measure,260  whereas the Carhart's atom pairs154  and the Nilakantan's topological torsions95  were used as descriptors. The authors claim that this approach “has several advantages over analogous ranking in the original descriptor space: matching latent structures is more robust than matching discrete descriptors, choosing the number of singular values provides a rational way to vary the ‘fuzziness’ of the search”.258 

The issue of “fuzzification” of similarity search has been addressed by Horvath et al.155–157  The first fuzzy similarity metric suggested155  relies on partial similarity scores calculated with respect to the inter-atomic distances distributions for each pharmacophore pair. In this case the “fuzziness” enables comparison of pairs of pharmacophores with different topological or 3D distances. Similar results156  were achieved using fuzzy and weighted modified Dice similarity metric.260  Fuzzy pharmacophore triplets (FPT, see Section 1.3.1.6) can be gradually mapped onto related basis triplets, thus minimizing binary classification artifacts.157  In a new similarity scoring index introduced in ref. 157, the simultaneous absence of a pharmacophore triplet in two molecules is taken into account. However, this is a less-constraining indicator of similarity than simultaneous presence of triplets.

Most similarity search approaches require only a single reference structure. However, in practice several lead compounds are often available. This motivated Hert et al.261  to develop the data fusion method, which allows one to screen a database using all available reference structures. Then, the similarity scores are combined for all retrieved structures using selected fusion rules. Searches conducted on the MDL Drug Data Report database using fragment-based UNITY 2D,253  BCI,252  and Daylight251  fingerprints have proved the effectiveness of this approach.

The main drawback of the conventional similarity search concerns an inability to use experimental information on biological activity to adjust similarity measures. This results in an inability to discriminate relevant and non-relevant fragment descriptors used for computing similarity measures. To tackle this problem, Cramer et al.42  developed substructural analysis, in which each fragment (represented as a bit in a fingerprint) is weighted by taking into account its occurrence in active and in inactive compounds. Subsequently, many similar approaches have been described in the literature.262 

One more way to conduct a similarity-based virtual screening is to retrieve the structures containing a user-defined set of “pharmacophoric” features. In the Dynamic Mapping of Consensus positions (DMC) algorithm263  those features are selected by finding common positions in bit strings for all active compounds. The potency-scaled DMC algorithm (POT-DMC)264  is a modification of DMC in which compounds activities are taken into account. The latter two methods may be considered as intermediate between conventional similarity search and probabilistic SAR approaches.

Batista, Godden and Bajorath have developed the MolBlaster method,208  in which molecular similarity is assessed by Differential Shannon Entropy265  computed from populations of randomly generated fragments. For the range 0.64 < T < 0.99, this similarity measure provides with the same ranking as the Tanimoto index T. However, for smaller values of T the entropy-based index is more sensitive, since it distinguishes between pairs of molecules having almost identical T. To adapt this methodology for large-scale virtual screening, Proportional Shannon Entropy (PSE) metrics were introduced.209  A key feature of this approach is that class-specific PSE of random fragment distributions enables the identification of the molecules sharing with known active compounds a significant number of signature substructures.

Similarity search methods developed for individual compounds are difficult to apply directly for chemical reactions involving many species subdivided by two types: reactants and products. To overcome this problem, Varnek et al.18  suggested condensing all participating reaction species in one molecular graph [Condensed Graphs of Reactions (CGR),18  see Section 1.3.2] followed by its fragmentation and application of developed fingerprints in “classical” similarity search. Besides conventional chemical bonds (simple, double, aromatic, etc.), a CGR contains dynamical bonds corresponding to created, broken or transformed bonds. This approach could be efficiently used for screening of large reaction databases.

Simplistic and heuristic similarity-based approaches can hardly produce as good predictive models as modern statistical and machine learning methods that are able to assess quantitatively biological or physicochemical properties. QSAR-based virtual screening consists of direct assessment of activity values (numerical or binary) of all compounds in the database followed by selection of hits possessing desirable activity. Mathematical methods used for models preparation can be subdivided into classification and regression approaches. The former decide whether a given compound is active, whereas the latter numerically evaluate the activity values. Classification approaches that assess probability of decisions are called probabilistic.

Various classification approaches have been reported to be used successfully in conjunction with fragment descriptors for building classification SAR models: the Linear Discriminant Analysis (LDA),266,267  the Partial Least Square Discriminant Analysis (PLS-DA),268  Soft Independent Modeling by Class Analogy (SIMCA),269  Artificial Neural Networks (ANN),270  Support Vector Machines (SVM),271  Decision Trees (DT), 269,272,273  Spline Fitting with Genetic Algorithm (SFGA),269 etc. Probabilistic methods usually used with fragment descriptors are: Naïve Bayes (NB)142  and its modification implemented in PASS,126  Binary Kernel Discrimination,6  Inductive Logic Programming (ILP),274  Support Vector Inductive Logic Programming (SVILP),133 etc.

Numerous studies have been devoted to classification (probabilistic) approaches used in conjunction with fragment descriptors for virtual screening. Here we present several examples.

Harper et al.6  have demonstrated a much better performance of probabilistic “binary kernel discrimination” method to screen large databases compared to backpropagation neural networks or conventional similarity search. The Carhart's atom-pairs154  and Nilakantan's topological torsions95  were used as descriptors.

Aiming to discover new cognition enhancers, Geronikaki et al.275  applied the PASS program,126  which implements a probabilistic Bayesian-based approach, and the DEREK rule-based system210  to screen a database of highly diverse chemical compounds. Eight compounds with the highest probability of cognition-enhancing effect were selected. Experimental tests showed that all of them possess a pronounced antiamnesic effect.

Bender, Glen et al. have applied129–133  several probabilistic machine learning methods (naïve Bayesian classifier, inductive logic programming, and support vector inductive learning programming) in conjunction with circular fingerprints for making classification of bioactive chemical compounds and performing virtual screening on several biological targets. The latter of these three methods (i.e., support vector inductive learning programming) performed significantly better than the other two methods.133  The advantages of using circular fingerprints were pointed out.131 

The Multiple Linear Regression (MLR) method was historically the first and to date the most popular method used to develop QSAR/QSPR models with fragment descriptors (Figure 1.14). Linear models involving fragments are built in several program packages: CASE,66–69  MULTICASE,97,98  TRAIL,101,102  ISIDA,18  EMMA,276  QSAR Builder from Pharma Algorithms277  and some others. The Partial Least Squares (PLS) regression,278,279  an alternative technique for building linear quantitative models, has also been successfully coupled with fragment descriptors.63,128,280–282  This approach is efficiently used the Holographic QSAR (HQSAR)63  (implemented in the Sybyl software253 ) and the “Generalized Fragment-Substructure Based Property Prediction Method”.282  The success of treating the fragment descriptors in PLS is explained by efficient handling of multicollinearity, which is a typical problem of fragment descriptors. Two other methods, the Group Method of Data Handling (GMDH)283  and the more recent Maximal Margin Linear Programming Method (MMLPM),284,285  also displayed their efficiency in building the linear models from an initial pool of highly correlated fragment descriptors.

Figure 1.14

General scheme of constructing linear QSAR/QSPR models based on fragment descriptors.

Figure 1.14

General scheme of constructing linear QSAR/QSPR models based on fragment descriptors.

Close modal

Among nonlinear regression methods used in conjunction with fragment descriptors, the Back-Propagation Neural Networks (BPNN)286–289  occupy a special place. It has been proved7,8  that any molecular graph invariant can be approximated by an output of a BPNN using fragment descriptors as an input. Indeed, numerous studies have shown that the BPNN models based on fragment descriptors efficiently predict various physicochemical properties16,290–294  and some biological activities16,163,295  of organic compounds. A popular ASNN (Associative Neural Networks) approach consists of an ensemble of BPNN coupled with kNN correction in the space of models.296  This technique, together with fragment descriptors, has been successfully used to model the thermodynamic parameters of metal complexation285  and melting point of ionic liquids.297  Besides, the Radial Basis Function Neural Networks298  (RBFNNs) have also been used with fragment descriptors for predicting the properties of organic compounds.285,299  The Support Vector Regression (SVR) technique300–303  is a serious “competitor” of neural networks, as has been demonstrated in QSAR/QSPR studies285,304  involving fragment descriptors.

In drug design, regression QSAR/QSPR models are often used to assess ADME/Tox properties or to detect “hit” molecules capable of binding a certain biological target. Thus, one could mention fragments based QSAR models for blood–brain barrier,305  skin permeation rate,306  blood–air307  and tissue-air partition coefficients.307  Many theoretical approaches to calculating the octanol–water partition coefficient log P involve fragment descriptors. In particular, it concerns the methods by Rekker,308,309  Leo and Hansch (CLOGP),245,310  Ghose-Crippen (ALOGP),81–83  Wildman and Crippen,86  Suzuki and Kudo (CHEMICALC-2),87  Convard (SMILOGP)88  and by Wang (XLOGP).89,90  Fragment-based predictive models for estimation of solubility in water311  and DMSO311  are also available.

Benchmarking studies on various biological and physicochemical properties305–307,312  show that QSAR/QSPR models for involving fragment descriptors in many cases outperform those built on topological, quantum, electrostatic and other types of descriptors.

In this section we consider several examples of virtual screening performed on a database containing only virtual (still non-synthesized or unavailable) compounds. Virtual libraries are usually generated using combinatorial chemistry approaches.313–315  One of simplest ways is to attach systematically user-defined substituents R1, R2,…, RN to a given scaffold. If the list for the substituent Ri contains ni candidates, the total number of generated structures is:

Equation 1.11

although taking symmetry into account could reduce the library's size. The number of substituents Ri (ni) should be carefully selected to avoid generation of too large a set of structures (combinatorial explosion). The “optimal” substituents could be prepared using fragments selected at the QSAR stage, since their contributions to activity (for linear models) allow one to estimate an impact of combining the fragment into larger species (Ri). In such a way, a focused combinatorial library could be generated.

The technology based on combining QSAR, generation of virtual libraries and screening stages has been implemented in the ISIDA program and applied to computer-aided design of new uranyl binders belonging to two different families of organic molecules: phosphoryl containing podands316  and monoamides.317  QSAR models have been developed using different machine-learning methods (multi-linear regression analysis, associative neural networks296  and support vector machines301 ) and fragment descriptors (atom/bond sequences and augmented atoms). These models were then used to screen virtual combinatorial libraries containing up to 11000 compounds. Selected hits were synthesized and tested experimentally. Predicted uranyl binding affinity was shown to agree well with the experimental data. Thus, initial data sets were significantly enriched with new efficient uranyl binders, and one of new molecules was found to be more efficient than previously studied compounds. A similar study was conducted for the development of new 1-(2-hydroxyethoxy)methyl)-6-(phenylthio)thymine (HEPT) derivatives potentially possessing high anti-HIV activity.318  This demonstrates the universality of fragment descriptors and the broad perspectives of their use in virtual screening and in silico design.

Despite the many advantages of fragment descriptors they are not devoid of certain drawbacks, which deserve serious attention. Two main problems should be mentioned: (i) “missing fragments”;319  and (ii) modeling of stereochemically dependent properties.

The term “missing fragments” concerns comparison of the lists of fragments generated for the training and test sets. A test set molecule may contain fragments that, on one hand, belong to the same family of descriptors used for the modeling, and, on the other hand, are different from those in the initial pool calculated for the training set. The question arises whether the model built from that initial pool can be applied to those test set molecules? This is a difficult problem because a priori it is not clear if the “missing fragments” are important for the property being predicted. Several possible strategies to treat this problem have been reported. The ALOGPS program,320  predicting lipophilicity and aqueous solubility of chemical compounds, flags calculations as unreliable if the analyzed molecule contains one or more E-state atom or bond types missed in the training set. In such a way, the program detects about 90% of large prediction errors.319  The ISIDA program18  calculates a consensus model as an average over the “best” models developed with different sets of fragment descriptors. Each model corresponds to its “own” initial pool of descriptors. If a new molecule contains fragments different from those in that pool, the corresponding model is ignored. As demonstrated by benchmarking studies,285  this improves the predictive performance of the method. For each model, the NASAWIN software99  creates a list of “important” fragments including cycles and all one-atom fragments. The test molecule is rejected if its list of “important” fragments contains those absent in the training set.321  The LOGP program for lipophilicity predictions322  uses a set of empirical rules to calculate the contribution of missed fragments.

The second problem of using fragment descriptors deals with accounting for stereochemical information. In fact, its adequate treatment is not possible at the graph-theoretical level and requires explicit consideration of hypergraphs.323  However, in practice, it is sufficient to introduce special labels indicating stereochemical configuration of chiral centers or (E/Z)-isomers around a double bond, and then to use them in the specification of molecular fragments. Such an approach has been used in hologram fragment descriptors324  as well as in the PARTAN language.238 

Fragment descriptors constitute one of the most universal types of molecular descriptors. The scope of their application encompasses almost all existing areas of SAR/QSAR/QSPR studies. Their universality stems from the basic character of structural theory in chemistry as well as from the fundamental possibility of molecular graph invariants being expressed in terms of subgraph occurrence numbers.8  The main advantages of fragment descriptors lie in the simplicity of their computation, the easiness of their interpretation as well as in efficiency of their applications in similarity searches and SAR/QSAR/QSPR modeling. Progress of their use in virtual screening could be related to the development of new types of fragments and of new mathematical approaches of their processing.

The authors thank GDRE SupraChem and ARCUS “Alsace –Russia/Ukraine” project for support and also Dr V. Solov’ev for fruitful discussions.

Figures & Tables

Figure 1.1

Circular fingerprints with Sybyl mol2 atom typing. An individual fingerprint is calculated for each atom in the molecule, considering those atoms up to two bonds from the central atom (level 2). The molecular fingerprint consists of the individual atom fingerprints of all the heavy atoms in the structure. (Adapted from ref. 132.)

Figure 1.1

Circular fingerprints with Sybyl mol2 atom typing. An individual fingerprint is calculated for each atom in the molecule, considering those atoms up to two bonds from the central atom (level 2). The molecular fingerprint consists of the individual atom fingerprints of all the heavy atoms in the structure. (Adapted from ref. 132.)

Close modal
Figure 1.2

Example of a Similog key. (Adapted from ref. 158.)

Figure 1.2

Example of a Similog key. (Adapted from ref. 158.)

Close modal
Figure 1.3

Randič basic graphs for a maximum number of nodes of 7.

Figure 1.3

Randič basic graphs for a maximum number of nodes of 7.

Close modal
Figure 1.4

Skvortsova's basic graphs for a maximum number of nodes of 5.

Figure 1.4

Skvortsova's basic graphs for a maximum number of nodes of 5.

Close modal
Figure 1.5

First ten structural fragments contained in molecular graphs of alkanes. (Adapted from ref. 170.)

Figure 1.5

First ten structural fragments contained in molecular graphs of alkanes. (Adapted from ref. 170.)

Close modal
Figure 1.6

Phenol acetylation and related Condensed Graph of Reaction. “Dynamical” bonds marked with green and red correspond, respectively, to formation and breaking a single bond.

Figure 1.6

Phenol acetylation and related Condensed Graph of Reaction. “Dynamical” bonds marked with green and red correspond, respectively, to formation and breaking a single bond.

Close modal
Figure 1.7

Generation of structural keys for a molecule of aspirin.

Figure 1.7

Generation of structural keys for a molecule of aspirin.

Close modal
Figure 1.8

Generation of hashed fingerprints. Each fragment leads to “switching on” of several bits. A bit with collisions is underlined and shown in bold.

Figure 1.8

Generation of hashed fingerprints. Each fragment leads to “switching on” of several bits. A bit with collisions is underlined and shown in bold.

Close modal
Figure 1.9

Generation of a molecular hologram. A molecule is broken into several structural fragments that are assigned fragment integer identifications (IDs) using the CRC algorithm. Each fragment is then placed in a particular bin based on its fragment integer ID corresponding to the bin ID. The bin occupancy numbers are the molecular hologram descriptors that count structural fragments in each bin. (Adapted from ref. 63.)

Figure 1.9

Generation of a molecular hologram. A molecule is broken into several structural fragments that are assigned fragment integer identifications (IDs) using the CRC algorithm. Each fragment is then placed in a particular bin based on its fragment integer ID corresponding to the bin ID. The bin occupancy numbers are the molecular hologram descriptors that count structural fragments in each bin. (Adapted from ref. 63.)

Close modal
Figure 1.10

(a) Structure of phenothiazine with descriptor centers marked on it. (Adapted from ref. 41.) (b) Descriptor center connection graph for phenothiazine. (Adapted from ref. 41.)

Figure 1.10

(a) Structure of phenothiazine with descriptor centers marked on it. (Adapted from ref. 41.) (b) Descriptor center connection graph for phenothiazine. (Adapted from ref. 41.)

Close modal
Figure 1.11

Examples of chemical structures corresponding to the same GWB reduced graph of type R/F (shown in center). (Adapted from ref. 224.)

Figure 1.11

Examples of chemical structures corresponding to the same GWB reduced graph of type R/F (shown in center). (Adapted from ref. 224.)

Close modal
Figure 1.12

A hierarchy of GWB reduced graphs. (Adapted from ref. 224.)

Figure 1.12

A hierarchy of GWB reduced graphs. (Adapted from ref. 224.)

Close modal
Figure 1.13

Examples of fragments with marked atoms used for modeling inhibitor activity against HIV-I reverse transcriptase for a congeneric set of HEPT derivatives.

Figure 1.13

Examples of fragments with marked atoms used for modeling inhibitor activity against HIV-I reverse transcriptase for a congeneric set of HEPT derivatives.

Close modal
Figure 1.14

General scheme of constructing linear QSAR/QSPR models based on fragment descriptors.

Figure 1.14

General scheme of constructing linear QSAR/QSPR models based on fragment descriptors.

Close modal

References

Close Modal

or Create an Account

Close Modal
Close Modal