 1 Introduction
 2 Strategies in inverse molecular design
 2.1 Genetic algorithms
 2.2 Monte Carlo methods
 2.3 Linear combination of atomic potentials
 2.4 LCAP Hamiltonian in extended Hückel tightbinding framework
 3 Recent applications
 3.1 Applying TBLCAP approaches for materials discovery
 3.2 Inverse molecular design for nonlinear optical materials
 3.3 Inverse molecular design for dyesensitized solar cells
 3.4 Experimental verification
 4 Conclusions
Chapter 1: Inverse molecular design for materials discovery

Published:30 Oct 2013

D. Xiao, I. Warnke, J. Bedford, and V. S. Batista, in Chemical Modelling: Applications and Theory, ed. M. Springborg and J. Joswig, The Royal Society of Chemistry, 2013, ch. 1, pp. 131.
Download citation file:
Inverse molecular design has emerged as an attractive computational approach to take on the challenges in materials discovery. We present a conceptual formalism for the idea of inverse molecular design as it is implemented to perform optimizations that searches along the molecular property hypersurfaces constructed directly from the Hamiltonian of molecular systems. We first outline the basic principles and procedures for the discrete optimization algorithms that are used in inverse design, namely genetic algorithms (GA) and Monte Carlo (MC) methods. Then, we bring the reader's attention to a new inverse molecular design strategy based on the scheme of linear combination of atomic potentials (LCAP), developed by Beratan and Yang in 2006. Finally, we review the progress made in applying LCAP in the tightbinding framework to successfully design novel nonlinear optical materials and dyesensitized solar cells. Due to the low computational cost of tightbinding electronic structure calculations, we envision TBLCAP as a promising approach to inverse molecular design for applications in materials discovery such as catalysts design and solar fuels.
1 Introduction
Discovering materials with optimum properties is a longterm dream^{1–8 } for both experimental and theoretical researchers. Historically, scientists used an approach of “trial and error” to find new materials that exhibit desired properties. Owing to the development in modern theoretical and computational chemistry (e.g., density functional theory^{9 }), predicting molecular properties using accurate and efficient quantum chemistry methods becomes more and more practical. As a consequence, inverse molecular design^{10–12 } has emerged as an attractive computational approach to take on the challenges in materials discovery.
Inverse molecular design is a general term describing strategies in molecular design, that are in contrast to direct design methods. In direct design, a new molecule is proposed first, and then the molecular property is computed or measured to check its potential use. In contrast, inverse molecular design aims at searching for optimum points on hypersurfaces defining propertystructure relationships, and then mapping out the molecular structures at the optimum points.^{13 } Hence, using the idea of inverse molecular design could significantly enhance the efficiency and success rate of molecular design and save costs in materials discovery.
Inverse molecular design has been implemented as an optimization method in theory,^{10–12 } assisting the search for optimum chemical structures using global optimization algorithms.
Here, f_{inv} is a notation for the operation of inverse molecular design. denotes a molecular property (an observable), which is a functional of the Hamiltonian operator . are the set of userdefined variables for varying the Hamiltonian. For example, these variables could be the indices defining a molecule as a composition of molecular fragments, a set of nuclear coordinates, or even atomic numbers.
O_{T} denotes a given target value of a molecular property, e.g. a maximum point of the molecular property. The minimization operation ‘min’ may be performed through a variety of different optimization algorithms that minimize the quantity . Thus, f_{inv} aims at finding a particular set (and thereby a molecular structure) that has the best match to the target molecular property. This is a general formulation for the idea of inverse molecular design. When applied to specific systems, the formulation may be transformed for the purpose of optimizing particular target molecular properties.
In this work, we focus on reviewing inverse molecular design based on hypersurfaces of molecular properties vs. molecular structures that are constructed through direct calculations of molecular properties from variable Hamiltonians (for representing different chemical structures). Alternatively, analogous hypersurfaces relating property and structure have been constructed based on statistical models for molecular properties with respect to sets of chosen molecular descriptors (for molecular structures or properties). Such hypersurfaces are used extensively for the inverse design based on quantitive structure activity relationship (QSAR). We refer interested readers to literature^{14–16 } on inverse QSAR, which is not the focus of this review.
In molecular structure space, the Hamiltonian variables are associated with the atom types and their spatial arrangement.^{11,17 } Different stochastic and deterministic optimization algorithms^{10–12,18 } have been adapted to work in inverse molecular design methods. The choice of an optimization method depends on how the particular Hamiltonian, linking structure and property, is varied during a search (i.e. depends on the set of Hamiltonian parameters/variables that are varied).
We begin with reviewing optimization algorithms that are based on discrete molecular objects such as genetic algorithms and Monte Carlo methods. Then, we describe an emerging approach named linear combination of atomic potentials (LCAP) developed by Beratan and Yang^{11 } for inverse molecular design. This approach allows us to search for optimum molecules using continuous or discrete optimization algorithms. In particular, we will review recent progress made in applying LCAP in the tightbinding (TB) framework, which could provide an efficient way for molecule search. Finally, we review applications of TBLCAP for optimizing nonlinear optical materials and dyesensitized solar cells. In particular, novel materials have been proposed by TBLCAP and verified by experiments. Due to the low computational cost of tightbinding electronic structure calculations, we envision that the TBLCAP will be a promising inverse molecular design method for taking on challenges in materials discovery such as catalysts design and solar fuels applications.
2 Strategies in inverse molecular design
Genetic algorithms and Monte Carlo methods are commonly used as optimizers for inverse molecular design in discrete molecular structure space of chemically representable candidates.
2.1 Genetic algorithms
Genetic algorithms^{19 } are methods tailored to address complicated multidimensional optimization problems.^{20 } They belong to a broader class of evolutionary algorithms^{20,21 } which originated in the 60s and 70s when scientists started to explore the possibility of using basic principles of evolution to develop adaptive and highly efficient and general optimization schemes. Nowadays, GAs find extensive use in a large variety of scientific fields. Notably, over the past 30 years, they have graduated to become a major tool for computational disciplines in physics, chemistry and materials science where they are used for atomistic and electronic structure level optimizations of molecular geometries, energies and properties.^{22 }
In inverse molecular design, GAs may be applied as optimization algorithms to locate vectors X={λ_{1}, λ_{2}…, λ_{n}} such that becomes minimal. In electronic structure calculations there is generally no simple relationship between X and the property of interest . The calculation of these properties often is complicated and comes at high computational cost.^{23 } Efficient optimization algorithms such as GAs are necessary for locating optima on the hypersurface .
Zunger et al.^{4 } adapted GAs for the purpose of inverse molecular design, and discovered structural motifs with optimal bandgaps in quaternary (In, Ga)(As, Sb) semiconductors. Hutchison et al.^{24 } used GAs for optimizing organic polymers for photovoltaics. To shed light on how GAs are used for inverse molecular design, we review selected basic principles and current developments.
GAs are population based methods.^{25 } In GA context, a population consists of a set of trial solutions {X_{a}a=1,…, population size}. Basic selection and recombination principles inspired by Darwinian evolution^{26 } drive the adaptation of the population towards a target property, i.e. the optimum of an objective function. Genetic operators^{26 } are realizations of the selection and recombination rules. They determine which trialsolutions (individuals) are selected from the population and specify how they are recombined to form new ones. The concept of fitness refers to the difference between the objective function value of a trial solution and the global optimum.
Figure 1 illustrates the basic steps common to all GA optimizations. (1) The first step in a GA optimization consists in generating an initial population which, in practice, will often be made of randomized trial solutions X. (2) Subsequently, GAs proceed to evaluate the individuals’ fitness. (3) Based on the fitness a selection operator chooses a number of individuals (parents) for recombination and formation of new individuals (mating). (4) During the mating process, crossover operators^{27 } recombine characteristics of parentsolutions to form new trial solutions while mutation operators^{28 } account for the introduction of small random changes with a certain likelihood. (5) The resulting new trial solutions are then pitched against their parents and form a new generation by replacing the weakest individuals of the previous one. Selection and mating are repeated until convergence is achieved.
Fitness function. The criteria applied in the selection process are expressed in terms of a fitness function f_{fit}(X).^{29,30 } To minimize a given objective function, f_{fit} can often be set to equal . However, appropriate scaling of the objective function may influence the roughness of the potential hypersurface and play an important role for the efficiency of the optimization. Approximations to the objective function can be used to estimate their fitness. Such strategies can be used to increase computational efficiency when the objective function evaluation comes at high computational cost.^{31 } Especially when electronic structure methods are needed to calculate energies or locally optimize individuals, the evaluation of a single individual's fitness may take hours or even days. In such cases, an efficient and sufficiently accurate approximation can often be used to identify promising candidates which are in turn treated at a higher theoretical level.
Selection operators^{32 } realize methods for choosing individuals from the existing population to generate new individuals (children) in the recombination process. Population size, the number of individuals chosen for recombination and the number of new individuals to be generated from a given generation are parameters that may be chosen to suit the problem.^{33 } In a straightforward selection, a certain amount of the fittest individuals would be chosen for recombination. In practice however, a variety of selection strategies exist which involve the drawing of random numbers.^{32 } Such probabilistic schemes introduce a likelihood for weak individuals to be drawn for recombination while ensuring that the fittest individuals have the highest probability of being selected. Most commonly applied are fitness proportionate selection schemes, where the likelihood of selection is proportional to the individual's fitness. In the alternative tournament selection method, the fittest individuals within randomly chosen subpopulations are drawn. Here, the number of subpopulations and the number of individuals to be drawn from each population are parameters that adjust the selection pressure, i.e. the likelihood of weak individuals to be drawn for recombination. Other selection schemes exist and are outlined in the literature.^{32 }
Crossover operators^{27 } realize the set of rules according to which the selected parents are recombined to form new individuals that exhibit combined or completely new characteristics. As is the case with all other GA operators, there is no unique way of defining such a procedure and several variants are commonly used. The simplest version is a single point crossover operator. It chooses a random crossing point between two parent vectors X and then interchanges the crossed sections. The number of crossing points can be increased to allow for more flexibility in the recombination process. A schematic of the procedure is presented in Fig. 2. The number of parents entering the recombination process, as well as the number of children resulting from the crossover is sometimes increased. “CutandSplice” techniques^{34 } differ slightly from the described crossover method; parents can have different crossover points which allows for evolutionary variation of the dimensionality of the optimization problem, or size of the system. The uniform crossover method^{35 } uses a fixed mixing ratio (e.g. 0.5) and allows each bit of information to be interchanged between the parents individually and independently. This strategy is equivalent to choosing a random number of crossing points while maintaining a given ratio of information that is exchanged between the parents.
Depending on how specifically individuals are encoded in the GA, exchange of random regions during the crossover procedure may not always produce meaningful outcomes. Whenever trialsolutions need to satisfy conditions such as minimum or maximum distances between nuclei, the ordering of variables may become important. A number of strategies exist to overcome this challenge. Some schemes involve reparation of a random trial solution while others create solutions that satisfy existing conditions by design. For example, Ahlrichs and coworkers^{36 } describe a crossing method designed for optimizing geometries of clusters. The structures are cut into fragments and recombined such that
only contiguous groups of atoms are interchanged and
the minimum interatomic distance is controlled. Before pitching the created individuals against the rest of the current population, their structures are locally relaxed to the closest local minimum.
Mutation operators. Most GAs incorporate the biological concept of mutation to preserve a degree of diversity during evolution of the population. Mutation operators introduce random changes to trial solutions while being generated in the crossover process. Mutation operators are designed to help the GA avoid local trapping which may occur whenever the population has obtained a reduced diversity due to evolution towards a nonglobal optimum. Mutation creates random features to enable sampling over new regions in the space of possible variables. The motivation for mutation operators is similar to the reasoning for introducing randomized selection operators that avoid an exclusive drawing of only the fittest individuals. Different versions of mutation operators exist^{28 } and are appropriate depending on the nature of the optimization and on whether the algorithm uses vectors of real numbers or catenated binary strings to represent individuals. For realcoded^{37,38 } GAs, a common procedure is to add a Gaussian distributed random value to an arbitrary variable in the vector. If the population is encoded in the form of binary vectors^{21 }, bit operations such as swapping, inverting, and scrambling the bits of variables can be used to introduce mutations. The likelihood of mutation and the number of mutations allowed per individual influence sampling and convergence of the optimization. In general, an excessive mutation probability will slow the convergence as the genetic information stored in the population is lost at an elevated rate and the method becomes equivalent to a random search algorithm.
Convergence. Ideally, the algorithm would stop producing new generations when the global optimum is found. However, in most GA applications, there is no clear indicator that the global optimum has been located, i.e., no single criterion exists that is sufficient for global convergence. In practice, different criteria are used to decide when to terminate an optimization.^{39 } The time available on a computer system might limit the number of generations that can be produced. For some applications it might be sufficient to locate solutions that exhibit a certain target fitness regardless of whether they correspond to the global optimum. Convergence is assumed when the population's fittest individuals stop to evolve and remain constant for a number of generations. Sometimes, manual inspection of the individuals can give a good indication for convergence. For example, chiral structures may be among the lowenergy geometries of metal clusters. If this is the case, then the GA has to locate both isoenergetic enantiomers. The lack of a sufficient convergence criterion is a problem for all global optimization methods.
More aspects of GAs
Seeding. In certain instances, the performance of GA optimizations may be improved greatly by seeding.^{40 } If common features or patterns in solutions are known prior to the GA optimization they may incorporated into the initial population. For instance, when searching for minimum energy metal cluster structures of a given size N, the structural motifs that appear in the most stable clusters of size N − 1 may be used to generate initial structures of size N that may be already close to the optimum.
Isolatedisland models^{41,42 } realize the idea of subdividing the total population into smaller subpopulations which may then evolve independently and under different conditions. Such approaches offer a high flexibility and may be advantageous whenever the GA parameters needed for an efficient convergence are not known. Subpopulations can be of different size, and their evolution may be governed by different selection, crossover, and mutation operators. The islands’ subpopulations may be further subdivided and we can allow for an occasional exchange of individuals between different subpopulations. While one population may be small and evolve through a set of GA operators that favor elitism and only small mutation rates, another might be large and allow for more variety. Populations may influence each other through exchange. Many variations are possible.^{43 }
Parallel GAs. GAs are particularly well suited for parallelization^{44 }, where the computational load is distributed over a number of individual processors. Generally, the timelimiting step in GA optimizations is the fitness evaluation for each new individual^{31 }, typically an electronic structure calculation of the electronic energy of a molecule or a derived property. In a naive parallel scheme, the computation of each individual's fitness can be delegated to a different processor. Such a parallel scheme is especially beneficial in case of large populations and when many processors are available.
In conclusion, GAs combine a set of characteristics that make them very attractive for molecular design.^{22,45 } Mimicking basic principles of evolution, they constitute a paradigm change in global optimization schemes.^{46 } As population based methods, they do not compute a single function value at a time, but a set of function values corresponding to a population of trial solutions which are initially distributed randomly. As new generations are formed by genetic operators, the newly formed individuals adapt and close in on minima. Human knowledge of the problem is required only for creating the set of rules for evolution, i.e., the genetic operators. However, seeding the initial population may greatly reduce the number of objective function evaluations needed for convergence. As metaheuristic optimization schemes, GAs are conceptually simple, easy to implement and adapt to a specific problem. Meanwhile our fundamental understanding of their convergence properties is still lacking.^{47 } A large variety of different versions of genetic algorithms exist. Different variants may be suited for different optimization problems. Once the specific GA model is chosen, parameters such as population size, the number of children and mutation rates may be adjusted. The success of this class of adaptive optimization algorithms is documented by the vast amount of applications of GAs to problems throughout science. While we introduce basic concepts and discuss several relevant aspects, issues, and advantages of GAs, we refer to recent reviews^{22,45 } for specific applications of GAs in the field of molecular and materials design.
2.2 Monte Carlo methods
Another popular strategy to circumvent exhaustive enumeration in molecular design problems are Monte Carlo (MC) methods.
Zunger et al. first adapted the Monte Carlo method for the purpose of inverse molecular design in a property space directly computed from a quantum mechanical Hamiltonian. They discovered the optimal atomic configuration Al_{0.25}Ga_{0.75}As for tuning optical bandgaps.^{10 } To outline the use of Monte Carlo methods in inverse molecular design, we review the basic principles of the current development of the algorithm in the followings.
Monte Carlo optimization operates in the same way as Monte Carlo simulation.^{48 } As with the previously discussed GAs, we provide a short overview of MC methods, pointing out key aspects. The Monte Carlo method begins with a random initial point in molecular configuration space (λ_{1}, λ_{2}, …, λ_{N}). Where the configuration space is understood to include both the three dimensional position of the atoms in the molecule and the identity of the atoms. From λ_{1}, λ_{2}, …, λ_{N} a new point in configuration space is selected by an elementary Monte Carlo move. This new point is called λ_{1,trial}, λ_{2,trial}, …, λ_{N,trial} is set equal to λ_{1}, λ_{2}, …, λ_{N} with probability
Where
Where is the function that maps the point in multidimensional configuration space {λ_{1}, λ_{2},…, λ_{n}} to a point in single dimensional property space. Note that this equation corresponds to a minimization of the desired property function . T is a fictitious temperature parameter, kept constant when performing Monte Carlo optimizations. However, it can be ramped up and slowly decreased to trap the system in the optimal {λ_{1}, λ_{2},…, λ_{n}}. This is known as simulated annealing.^{49 }
In essence, these equations state: if property value of the trial point in configuration space is less than the current property value, then accept the trial solution. However, if the property value of the trial point in configuration space is greater than current property value, then accept the trial solution if a random number between 0 and 1 is less than exp
At first glance it may seem counterproductive to accept less optimal configurations with a nonzero probability. However, by accepting such configurations the algorithm is capable of moving out of local minima and ultimately finding the global minimum.^{49 } The idea is that through repeating MC and SA procedures we move closer to the true global optimum.
Key issues in implementation. The algorithm described above is the minimal ingredient required to perform Monte Carlo optimization. In practice, several other issues must be addressed to get meaningful results in an acceptable amount of time. MC optimizations permit great flexibility in the specific implementation of how the MC trial configurations are generated. There are in fact many different kinds of ways to make an elementary MC move and thus generate λ_{1trial}, λ_{2trial},…, λ_{Ntrial}. An important result from MC theory is that the choice of how the Monte Carlo moves are made does not affect the ultimate global optimum reached. Rather, it affects the rate in MC steps at which the simulation converges to a global optimum.^{50 } It is very important to wisely choose how one sets up the Monte Carlo move method. The specific details concerning the generation of λ_{1trial}, λ_{2trial}, …, λ_{Ntrial} can be refereed to existing literature.^{51 }
The various methods of generating trial configurations are similar in that they try to produce moves that have a high probability of acceptance and produce trial points that are “far” from each other in configuration space. Meeting these two requirements translates into an ability to quickly explore the configuration space.
In addition, for SA we must also address the question of how quickly we allow the T to change while optimizing. We know from SA theory that lograte cooling will guarantee that we are trapped in the global minimum. This is often prohibitively expensive in many applications. Another common cooling function is exponential cooling. While this cooling function is computationally advantageous, it often leads to final configurations that are trapped in local minima.^{50 }
Recent algorithm advances. There are many aspects that need to be addressed in order for the Monte Carlo method to be successfully applied to the inverse design of molecules. Even with these methods, when the property surface is rugged, the Monte Carlo method may become trapped in local minima. In addition continuous optimization schemes such as the continuous LCAP search outlined below have been shown to be less efficient in exploring the entire property surface when it is rugged.^{52 }
To overcome these issues and apply the LCAP inverse design method to new problems, such as inhibitor design, Hu, Beratan, and Yang have devised a way to incorporate local gradient information in the Monte Carlo method.^{52 } This method is known as the gradient directed Monte Carlo method (GDMC). GDMC uses property gradients to jump between discrete molecules. When the search algorithm is trapped in a local optima, random MC moves are helpful to overcome local barriers.
GDMC is advantageous for two reasons. Firstly, it is computationally efficient as is saves computational time by not searching “intermediate” states of continuous property surfaces. Such states correspond to nonphysical structures. Secondly, as mentioned, this method is able to move out of local minima when the property surface is very rugged.
The GDMC method works by constructing a continuous virtual property surface using the LCAP procedure. This algorithm is then fed the initial molecule. The property of interest is then computed along with its gradients. Proposed structures or atom types are generated using the property gradients produced with the LCAP procedure. Finally, these new structures or atom types are accepted or rejected according to the metropolis rule.
In conclusion, although MC methods are conceptually simple, some drawbacks exist which often times limit their applicability. It is often prohibitively expensive to perform true logarithmic cooling. Recently, very promising advances include the gradientdirected MC method that combines the favorable properties of LCAP approaches with the advantages of MC methods.
2.3 Linear combination of atomic potentials
Instead of searching for molecules “one by one” in the discrete space of molecular compositions, Beratan and Yang introduced a continuous optimization strategy based on the scheme of linear combination of atomic potential for inverse molecular design.^{11 } The LCAP approach provides a continuous interpolation between discrete molecular structures. This characteristic distinguishes the approach from a earlier strategy.^{13 } As a result, a continuous molecular property hypersurface with regard to variations of the chemical structure is constructed. Such a continuous property surface allows property gradients to be defined and evaluated, leading to efficient structure optimization.
LCAP scheme in DFT. In density functional theory, the electronic structure problem is cast as
where T is the kinetic energy operator. v_{ext} is the external potential operator. v_{H} is the Hartree energy operator. v_{xc} is the exchangecorrelation energy operator. E is the energy. φ is the electronic wavefunction. Given the option of a few atom types (or functional groups) in one or more atomic atomic sites, the external potential can be written as a linear combination of atomic potentials^{11 }
Here, is the external potential of atom or group type A at position . Specifically, if A represents an atom type, is given by
where Z_{A} is the atomic number of atom A. The optimization coefficients define the weighting of a particular atom or group type, and are varied during the optimization. For real molecules, one value must be equal to one and all others must be zero at a given . That is, no more than one chemical unit may exist at any position . The constraints on during the optimization are and . can vary continuously between 0 and 1, leading to continuous transformation of atom types.
The potential and the number of electrons completely specify the input for the groundstate electronic structure and its properties. The external potential establishes a space on which optimization is conducted to design molecules. When the optimum is reached in a continuous optimization, the values are rounded to the nearest chemically representable molecule, zero or one. Earlier optimizations of electronic nonlinear polarizabilities support the viability of this rounding approach.^{11 }
Following the LCAP scheme in DFT, inverse molecular design methods have also been developed in the frameworks of tightbinding^{18,53 } and semiempirical^{54 } electronic structure theories.
Optimization on LCAP hypersurfaces (of molecular properties vs. structures) can then be carried out using continuous or discrete optimization algorithms. For example, one can use the gradients information to continuously search along LCAP hypersurfaces as reported initially.^{11 } This strategy can be effective when the molecular property surface is not particularly rugged. One can make discrete jumps between real molecules on the property surface, following the LCAP gradient analysis. One can introduce stochastic jumps to exit local minima, as in finite temperature Monte Carlo sampling. This is known as the gradient directed Monte Carlo (GDMC) approach.^{54,55 } Discrete optimization algorithms have been developed with the LCAP scheme for optimizing hyperpolarizability.^{56,57 } The discrete bestfirstsearch scheme was adapted for the LCAP scheme to optimize the photoacidity of 2naphthol derivatives.^{58 }
Variational particle number approach for inverse molecular design. Similar in spirit to LCAP, another new inverse molecular design strategy was developed by von Lilienfield et al. on the basis of variational particle numbers.^{12 } This approach uses atomic numbers as variables and gradients of molecular properties with respect to atomic numbers to guide the search for molecules. The authors applied this approach to design ligands for protein binding. Later, the approach was developed for searching molecules in the chemical compound space using atomic positions, atom type, and number of electrons as variables.^{17 }
LCAP Hamiltonian in the tightbinding framework. The LCAP principle has been implemented in a Hückel tightbinding framework by Xiao et al.,^{18 } denoted TBLCAP, and later expanded into the extended Hückel framework.^{53 } In the following, we present the implementations for TBLCAP approaches in the Hückel and extended Hückel frameworks, respectively.
In the Hückel tightbinding framework^{59 }, the Hamiltonian is represented in matrix form for describing the electronic structure. The interactions are only accounted for between the p_{z}orbitals, the basis functions of the p_{z}orbitals are assumed orthonormal. The electronic interactions happen only between nearest neighbor sites. Thus, the Hückel Hamiltonian matrix is
where i and j are indices of atomic sites, and n equals the total number of atomic sites in a molecule.
When applying LCAP to the Hückel Hamiltonian matrix, the coefficients are not operated on external potentials, but directly on the elements of the Hückel Hamiltonian matrix. Assuming there are n sites in a molecular framework for inverse design, one intends to search for the optimal atom (or group) type for some (or all) of these sites for the target property. For each variable site, there are N_{type} atom or group choices. The coefficients in Eq. (5) may be expanded to include the coefficients for each site. The site energy H_{ii} (i.e., diagonal term) is determined by the available atoms for a site, so the LCAP value of this matrix element is
where is the site energy of atom type A at site i, is the LCAP weighting coefficient for atom type A on site i, and is the total number of atom types possible at site i. For chemically fixed sites, the site energy equals the site energy of the atom type at that site: and . The interaction between sites i and j (i.e., offdiagonal term) is:
where is the interaction strength between atom type A (at site i) and A′ (at site j). This approach differs from an earlier approach by Kuhn and Beratan^{13 } who restrict the Hamiltonian matrix elements to interpolate between values arising from several atomic possibilities. Since molecular electronic properties are directly computed from the LCAP Hamiltonian matrix, optimizing molecular properties can be achieved by evolving the coefficients.
The choice of candidate atom types for this TBLCAP approach depends on the availability of Hückel parameters and . In the literature^{60 }, the Hückel parameters of 13 atom types are available for the C, N, P, B, O, F, Si, S, and Cl elements. These parameters were obtained from PariserParrPople (PPP) calculations.^{61 } Thus, the Hückel TBLCAP method can be applied to molecular systems with electronic structures dominated by the p_{z} orbitals interactions that come from the C, N, P, B, O, F, Si, S, or Cl elements. As an example, the Hückel parameters for C, N, and P are shown in Table 1.
.  .  .  .  . 

C  N  P  
C  0.00  1.00  –  – 
N  −0.51  1.02  1.09  – 
P  −0.19  0.77  0.78  0.63 
.  .  .  .  . 

C  N  P  
C  0.00  1.00  –  – 
N  −0.51  1.02  1.09  – 
P  −0.19  0.77  0.78  0.63 
2.4 LCAP Hamiltonian in extended Hückel tightbinding framework
The extended Hückel tightbinding (EHTB) approach has been extensively applied for calculations of electronic structures for a wide range of molecular and solidstate structures^{62,63 } as well as for studies of sensitized TiO_{2} surfaces including the analysis of interfacial electron transfer.^{64–68 } In the EHTB theory, the timeindependent Schödinger equation is represented in matrix form
where H is the extended Hückel Hamiltonian in the basis of Slatertype atomic orbitals (STO's), Q is the matrix of eigenvectors, E is the diagonal matrix of eigenstate energies, and S is the overlap matrix of the STO's. Note that the Hamiltonian matrix elements include the interactions between any two sites (not just the nearest neighbor sites). In addition, the electronic interactions between any type of orbitals (not just p_{z} orbitals) are considered.
When the LCAP scheme is applied to the EH Hamiltonian, the LCAP coefficients are operated onto the Hamiltonian matrix elements. Assuming possible atom types at atomic site i, the participation weight of each atom type A at site i is given by the coefficient , allowed to vary continuously between 0 and 1. For each atom site, the sum of is 1, i.e., the constraints are and . The diagonal matrix elements of the EHTBLCAP Hamiltonian are defined as follows:
where represents a STO of atom type A and is the EHTB diagonal Hamiltonian matrix element for atom type A at site i. Note that for the specific case of , since all other participation coefficients at the isite are equal to zero. More generally, is an arithmetic average of EH matrix elements associated with all possible atom types at site i weighted by their corresponding participation coefficients. The offdiagonal matrix elements are defined as follows:
with representing the atomic orbital of atom type A′ at site j, and the original offdiagonal Hamiltonian element for the case of atom type A at site i, and atom type at site j:
where K′=K+Δ^{2}+Δ^{4} (1−K) is defined according to the original WolfsbergHelmholz formula^{69 } with and K=1.75. The offdiagonal overlap matrix elements for atom types A at site i and atom types A′ at site j are defined as
The participation coefficients are initialized randomly. These coefficients are subsequently optimized with respect to the property of interest (e.g., solar light absorption) by using elements , and as defined by the EHTBHamiltonian, with STO parameters reported by Hoffmann et al.^{69 } without any further corrections. The extended Hückel parameters are available for most of the elements in the periodic table, enabling EHTBLCAP to search molecular structures other than πconjugated molecular systems. In addition, the basis sets for extended Hückel calculations include a variety of valence orbitals types. For example, the STO basis set includes 3d, 4s and 4p atomic orbitals for Ti, 2s and 2p atomic orbitals for O, C, and N elements, and 1s orbital for H.
3 Recent applications
3.1 Applying TBLCAP approaches for materials discovery
Once the TBLCAP Hamiltonian is formulated, molecular property functions are computed from the TBLCAP Hamiltonian. Following the gradients of molecular properties with respect to the TBLCAP coefficients, one can reach the optimum points of molecular properties using deterministic search algorithms. In the followings, we will show how the TBLCAP approaches have been to optimize nonlinear optical materials and to optimize dyesensitizers for dyesensitized solar cells, respectively. For each of the applications, the key elements for implementing the TBLCAP scheme are illustrated in the following order: (i) target molecular properties, (ii) gradients of molecular properties, (iii) molecular framework, and (iv) optimization results. The TBLCAP hypersurface studies are summarized. The optimization efficiency of TBLCAP is discussed for the design of nonlinear optical materials at the Hückel level, and the experimental verification for optimized lead chromophores is shown for the design of dyesensitizers on TiO_{2} at the extended Hückel level.
3.2 Inverse molecular design for nonlinear optical materials
When photons interact with nonlinear optical (NLO) materials, the frequencies, phases and other properties can alter. Because of their ability to manipulate photonic signals, NLO materials are highly demanded in technologies such as optical communication, optical computing, and dynamic image processing.^{70–77 } For molecular design, materials with large NLO responses, such as the first hyperpolarizability^{72,77 } are desired to improve the efficiency of manipulating photonic signals, such as changing photon frequencies.
Target molecular properties. To find optimum nonlinear optical materials, Xiao et al.^{18 } maximized the electronic hyperpolarizability along the LCAP hypersurface using the Hückel TBLCAP approach. For the purpose of comparison, they also maximized the polarizability (a linear optical property) in the same approach. These two target molecular properties are computed based on perturbation of the electronic energy of a molecule. The response of the electronic energy of a molecule to an external electric field F is
where indices i, j and k represent the Cartesian directions. The molecular polarizability α is the second derivative of E with respect to F, , and the first hyperpolarizability β is the third derivative of E with respect to F, . The target function of polarizability is defined as the average of the diagonal elements of α:
Here, the polarizability components (α_{xx}, α_{yy} and α_{zz}) are computed using the finitefield methods:
The molecular property function of the molecular hyperpolarizability is^{78 } where and the subscript a represents x, y, or z. The hyperpolarizability components are computed using the finitefield methods:
and
The electronic energies E(0) and E(F_{i}) (with field) are calculated from the trace of the Hamiltonian and the density matrices. The Hamiltonian interacting with the field is,
Here, k and l are indices for the porbital basis functions. In the Hückel tightbinding approach, the transition dipole integral is approximated as , where is the i^{th}direction Cartesian coordinate of the k^{th} atom. In the finitefield method, the numerical value of the field F is taken to be 0.01 V/Å. Once optimized the stability of the property values is verified by comparing the results to sumoverstate (SOS) calculations.^{79 }
Gradients of target properties. The target properties and were optimized using a continuous optimization algorithm, i.e., the quasiNewton method.^{80 } In the quasiNewton method, the gradients are calculated numerically using finitedifferences,
Function f represents any target property function, and represents variation of . Once optimization is complete, is rounded to 0 or 1 to determine the corresponding chemically representable, “real” molecule.
Molecular frameworks. To apply the Hückel TBLCAP optimization method, Xiao et al.^{18 } designed ten sp^{2 } molecular frameworks, as shown in Fig. 3. Framework 1 is a linear conjugated lattice. Frameworks 2–10 are lattices with ring structures, and are listed in the order of ascending number of chemical structures that can be built using each framework. The varying sites in these frameworks are allowed to take the options of atom types (i.e., C, N, or P) or atom pairs (i.e., CC, NN, CN, NC, CP, PC, NP or PN). The number of possible chemical structures out of the frameworks range from 10^{2} to 10^{16}. With the chosen atom types and functional groups, all possible optimal structures for each molecular framework represent synthetically plausible molecules based on known heterocyclic chemistry.^{81 }
Optimum structures. Figs. 4 and 5 show optimized structures with maximum and , respectively. These structures are global optimum points as verified by exhaustive enumerations, and they are obtained after 100 random initializations. By comparison, there are interesting findings for optimum nonlinear optical materials. Firstly, hyperpolarizability and polarizability increase with the size of molecules. For example, the maximal (or ) value for framework 1 (the linear conjugated chains) is larger than those of framework 2–9 (smaller aromatic frameworks), and smaller than that of framework 10 (large multiring aromatic framework). Frameworks 1 and 10 have the largest linear dimension. This finding is consistent with the literature.^{82–84 } Secondly, optimized structures are of lower symmetry than optimized structures, verifying that nonlinear optical materials favor asymmetric chromophores such as the donoracceptor “pushpull” framework.^{70,72,74,83,85,86 } Based on these findings, we conclude that the TBLCAP inverse design can indeed lead us to the optimal structures or lead chromophores for designing nonlinear optical materials.
Smoothness of LCAP surfaces. Polarizability and first hyperpolarizability surfaces are shown Figs. 6 and 7 as a function of the mixing parameters for two sites in molecular framework 8. The property values change relatively smoothly with variation of group chemistry at the two sites. In addition, at the end of continuous optimizations, most of the results converge to 0 or 1 for the variables. Rounding the other coefficients to 0 or 1 does not dramatically change the property values. This suggests that the Hückel TBLCAP hypersurfaces for and are relatively smooth. Only a few different optimal structures (local) arise from the 100 randomly seeded searches, indicating the presence only a small number of local optima in the hypersurface.
Optimization efficiency. To evaluate the efficiency of the TBLCAP search, Xiao et al.^{18 } compared the optimization time with the time required for exhaustive enumeration. For example, the exhaustive enumeration time for is 0.8 seconds for framework 2 (with 512 possible structures), and 9 hours for framework 9 (2.4×10^{6} possible structures). The efficiency metric is defined as the ratio of the exhaustive enumeration time to the average time for one random seed optimization (η). Setting the continuous convergence precision to 10^{2} and using the average time based on 100 random seed molecules, they constructed a loglog plot of η vs. the number of possible chemical structures in the library N (see Fig. 8). They find that η increases as N increases. The overall search efficiency is where M is the number of random initial seeds.
The TBLCAP optimization generally shows high efficiency of 10^{3} when the chemical space size reaches approximately 10^{6}. This suggests that the TBLCAP search can be very efficient for chemical spaces containing millions of structures, if using 100 random seeds. The optimizations of different properties show different efficiencies. Fig. 8 shows that optimizing E_{gap} (energy gap) is more efficient than optimizing .
The tightbinding LCAP implementation described here provides an approximate framework that can be used to discover optimal structures in extremely large molecular libraries. This continuous search of chemical space is effective and efficient. The number of molecular property computations required to find the optimal structure is a small fraction of the overall library size, suggesting the potential utility for this approach for designing structures in large molecular spaces. LCAP optimization may become more costly as geometry optimization and solvent effects are introduced. Results from optimizations using larger libraries suggest that large linear polarizations favor symmetric structures, and that large first hyperpolarizabilities favor less symmetric structures, as expected.
3.3 Inverse molecular design for dyesensitized solar cells
The design of molecular chromophores is critically important for the development of dyesensitized solar cells (DSSCs)^{87,88 } for photoelectricity generation and photocatalysis.^{64,65,89,90 } Since the discovery of DSSCs by O'Regan and Grätzel,^{87 } hundreds of dyes have been developed and tested,^{91–94 } and cells with photoconversion efficiencies as high as 11% have been reported.^{87,88,95 } However, for almost 20 years the outstanding challenge remains to be the design of DSSCs with even higher photoconversion efficiency. Xiao et al. introduced an inverse design methodology to guide the synthesis and optimization of lead molecular chromophores with suitable photoabsorption properties.
The target molecular property is the photoabsorption probability given by
defined as the sum of products of the oscillator strengths of electronic state transitions q←p, with transition wavelengths λ_{pq}, times the probability of having a solar photon with such a wavelength
as modeled by the spectral radiance of a black body at T=5523 K (see Fig. 9 ). The oscillator strength f_{pq} corresponds to the transition between electronic eigenstates ψ_{p}〉 and ψ_{q}〉^{96 }
where v_{pq} is the transition frequency in cm^{−1}, c is the speed of light in vacuum, m_{e} is the electron mass, h is the Planck constant, e is the electron charge, and is the transition dipole moment. For pure states (i.e., states with participation coefficients equal to either 1 or 0), the transition dipoles are obtained in terms of the EHTB eigenstates and in the basis set of atomic STO's φ_{i,α}〉 where α is the type of atomic orbital, i labels the atomic site, and are the expansion coefficients obtained by solving the eigenvalue problem, introduced in Eq. (10), giving:
For alchemical states (i.e., with ), the generalized atomic orbitals are
where elements are calculated by using the STO atomic basis set. Substituting Eq. (27) into Eq. (24) and then Eq. (24) into Eq. (22), the total ‘photoabsorption’ spectrum of the alchemical structure with participation coefficients is obtained.
The structures that maximize light harvesting were obtained by summing over electronic state transitions, introduced in Eq. (6) must include all allowed transitions in the UV visibleIR range where there is significant solar radiance. For photoconversion, however, it is important to maximize photoabsorption due to electronic transitions that yield interfacial electron transfer (IET) (e.g., excitations to the adsorbate LUMO, mixed with electronic states in the conduction band, and cross transitions directly into electronic states in the conduction band). For example, the phenylacac HOMOLUMO transition has been shown to yield ultrafast IET into the TiO_{2} surface. Hence, the inverse design of chromophores based on phenylacac as the reference structure, is focused on the optimization of photoabsorption by maximization of
Here, f_{HL} and P (λ_{HL}) are computed by using Eqs. (24) and (23), respectively, for p=HOMO and q=LUMO.
Gradients of molecular properties. The total photoabsorption cross section is maximized with respect to the participation coefficients by using standard optimization techniques (e.g., the quasiNewton BFGS algorithm).^{80 } The gradients of the photoabsorption probability are computed using the finite increment expression.
Molecular framework. The phenylacac linker is particularly attractive for sensitization of metaloxide surfaces because it binds strongly to nanoporous TiO_{2} thinfilms, even in aqueous solutions and under oxidative conditions.^{65 } However, its photoabsorption spectrum peaks in the UV region. Therefore, it is important to search for structures with a common binding motif to TiO_{2} but with enhanced photoabsorption of solar light leading to interfacial electron injection.
The TBLCAP inverse design method is illustrated as applied to the reference molecular framework defined by the phenylacac anion (see Fig. 10), a chromophore that absorbs only in the UV region (e.g., at 284nm in methanol solution). To locate structures with maximal photoabsorption in the visible range (due to HOMOLUMO transitions), the atom types in the sixmembered ring are continuously varied as a linear combinations of C, N, O, S and H atoms. The phenylacac structure in Fig. 10 is used as reference framework in the EHTBLCAP optimization of f_{HL}P(λ_{HL}). The group sites marked by ovals are linear combinations of C–H, N–X, O–X and S–X. The symbol X is used to denote “dummy atom” sites. The boxed sites allow for a mixing of C–H, C=O and C=S functional groups. These variational degrees of freedom result in a total number of 144 possible molecules for this simple molecular framework.
Optimum structures. While the nuclear positions are defined by the reference framework and remain fixed during the optimization, TBLCAP optimization of the objective function with respect to the participation coefficients identifies local minima on the alchemical potential hypersurface. The participation coefficients corresponding to the alchemical minimum are then rounded off to yield a chemically representable candidate. The geometry of the candidate structure is then optimized in a DFT/B3LYP gas phase calculation, yielding the optimum structure. Subsequently, the TDDFT/B3LYP photoabsorption spectrum of 3acacpyran2 one is compared to the TDDFT/B3LYP photoabsorption spectrum of the phenylacac structure that served as starting point for the optimization. This comparison validates the predictive power of the TBLCAP inverse design approach.
Smoothness of hypersurfaces. In this case, optimizing f_{HL}P(λ_{HL}) means locating maxima on a 14dimensional hypersurface. The smoothness of this surface is analyzed in Fig. 11. It presents a twodimensional contour plot of an optimization pathway from the phenylacac starting point to 3acacpyran2one. Sites 1 and 2 are varying between C=S and C=O (i.e., b^{(c=0)}+b^{(c=s)}=1) and sites 3 and 4 are varying between O–X and N–X (i.e., b^{(o−x)}+b^{(N−X)}=1). The data presented in Fig. 11 suggests that f_{HL}P(λ_{HL}) varies smoothy with variation of b^{(N−X)} and b^{(c=0)} and that the use of gradients for locating minima is justified. Fig. 11 also indicates small gradients of f_{HL}P(λ_{HL}) in the region with b^{(c=o)}>0.78 which means that gradient based local optimization may be challenged. To avoid areas with inadequate gradient information, randomized starting points (sets of participation coefficients) that fall into this area are simply rejected. For the area with b^{(c=o)}<0.78, the surface shows a global trend to increase as b^{(N–X)} decreases from 1 to 0, leading to successful searches. For b^{(c=o)}>0.78, the objective function value globally increases as b^{(N−X)} decreases, enabling successful gradientbased optimization. Except for a narrow area formed by b^{(c=o)}=0.78−0.85 and b^{(N−X)}=0.62−1.0, where the LCAP search paths might be trapped, any other TBLCAP paths with random initialization leads to structure 3acacpyran2one (i.e., the point with b^{(c=o)}=1 and b^{(o−x)}=1). LCAP searches might get trapped in the small area defined b^{(c=o)}=0.78−0.85 and b^{(N−X)}=0.62−1.0. All other randomly initialized searches will deterministically lead to 3acacpyran2one, i.e. the point defined by b^{(c=o)}=1 and b^{(o−x)}=1. For example, starting a gradient search at point (b^{(c=o)}=0.95, b^{(o−x)}=0.80) (see Fig. 11), the gradientbased search algorithm moves along the points (0.86, 0.35), (0.86, 0.34) and (0.90, 0.24) to approach (0.84, 0.01) which is very close to the local alchemical optimum (0.86, 0). Rounding the participation coefficients to the 1 and 0 leads to the chemically representable 3acacpyran2one.
Predicted electron injection property. The timedependent probability for the electron to be localized in 3acacpyran2one at time t after photoexcitation is obtained via quantum electronic dynamics simulations. The results presented in Fig. 12 show, that after being excited to the LUMO of adsorbate 3acacpyran2one, the electron is injected into the TiO_{2} surface in about 10fs. This ultrafast interfacial electron transfer can be used to induce charge separation via photoexcitation of the covalently attached adsorbate.
3.4 Experimental verification
3acacpyran2one contains a pyran2one ring that is also present in commercially available coumarinderived dyes.^{91,93,97 } The pyran2one ring is fused to a phenyl ring and anchoring groups (–COOH) at positions 5 and 6. C343 and NKX2311 and similar coumarinCOOH derived dyes can be obtained by adding donor/acceptor groups or via extension of the πconjugated system. This class of dyes are known to exhibit strong absorption bands in the visible range as well as fast interfacial electron injection rates (fs timescale).^{91,93 } The available literature^{97 } on coumarinderived dyes suggests that the LCAPidentified 3acacpyran2one could be used as a lead chromophore in practice. Theory predicts that 3acacpyran2one will exhibit an absorption at the in the UVVis range and that it will engage in interfacial electron transfer to semiconductor surfaces. To experimentally validate these predictions, 4acaccoumarin has been synthesized and its behavior analyzed, both, in solution and adsorbed to TiO_{2}.
Figure 13 compares experimental and calculated absorption of the 4acaccoumarin anion solvated in methanol. One equivalent of NH_{4}OH is added in the experiment to deprotonate 4acaccoumarin. In semiquantitative agreement with experiment (Fig. 13b the theoretical spectrum of 4acaccoumarin anion in Fig. 13a shows an absorption band at 406nm.
Figure 14 compares calculated and experimental absorption of the 4acaccoumarin anion model structure (Fig. 14a). In the experiment, 4acaccoumarin anion is adsorbed onto a thin film of TiO_{2} (Fig. 14b). Calculation and experiment are in agreement regarding the band around 405–410nm. Gaussian line broadening (FWHM=60nm) is used to obtain the theoretical spectrum from excitation energies and oscillator strengths.
The result of the electron dynamics simulation for the 4acaccoumarin anion adsorbed onto the TiO_{2}anatase (101) surface is shown in Fig. 15. As shown in Fig. 15, the 4acaccoumarin anion electron density (LUMO) rapidly decays in about 15fs and the electron is injected into the TiO_{2} conduction band (Fig. 15c).
4 Conclusions
Inverse molecular design has emerged as an attractive computational approach to take on the challenges in materials discovery. Inverse molecular design aims at searching for optimum points on the hypersurfaces defining propertystructure relationships, and mapping out the molecular structures at these points. We present a conceptual formalism for the idea of inverse molecular design when it is implemented as an optimization problem that searches along the molecular property hypersurfaces constructed directly from the Hamiltonian of molecular systems. We review a few stateoftheart strategies in inverse molecular design and present the basic principles and procedures of the two most commonly used optimization algorithms: genetic algorithms (GA) and Monte Carlo methods.
Historically, discrete optimization algorithms (such as GA and MC) are used for inverse molecular design. Here, we bring the attention to a new inverse molecular design strategy based on the linear combination of atomic potential, developed by Beratan and Yang in 2006.^{11 } The LCAP approach provides a continuous interpolation between discrete molecular structures, enabling us to search optimum molecules efficiently using deterministic or continuous optimization algorithms. In addition, discrete optimization algorithms were combined with the LCAP scheme for inverse molecular design. We present the basic principle of LCAP in the framework of density functional theory, as originally proposed.
In particular, we review the development of LCAP in the framework of tightbinding theory (i.e., TBLCAP approaches) and outline the progress in applying TBLCAP approaches for discovering nonlinear optical materials and dyesensitized solar cells. The TBLCAP method was first developed in the Hückel tightbinding framework,^{18 } and later expanded into the extended Hückel tightbinding framework.^{53 } The Hückel TBLCAP method was used to successfully find global optimal structures for nonlinear optical materials in a few extremely large chemical libraries (with 10^{2} to 10^{16} possible molecules). The EHTBLCAP method successfully predicted a novel lead chromophore linker 3acacpyran2one, which could be used as lead adsorbate anchor in practical applications to dyesensitized solar cells. Experimental synthesis and spectroscopic characterization confirm that a 3acacpyran2one derivative yields improved sensitization to solar light and provides robust attachment to TiO_{2} even in aqueous conditions.
Even though there is ample room for improvement in TBLCAP approaches for inverse molecular design such as relaxing the frozen molecular framework during the optimization and improving the smoothness of the TBLCAP structureproperty hypersurfaces, the TBLCAP methods are already established as promising tools for discovering materials that are realizable in experiments. Due to the low computational cost of tightbinding electronic structure calculations, we envision that the TBLCAP approaches will be one of the promising inverse molecular design methods for taking on more challenges in materials discovery in the future such as catalysts design in solar fuel applications.