- 1.1 Introduction
- 1.2 Text-based Searches Using Enzyme Name
- 1.3 Sequence-driven Approaches
- 1.3.1 Probe Technology Based on PCR Primer Design
- 1.3.2 Pairwise Sequence Alignment-based Strategy
- 1.3.3 Signature-/Key Motif-based Strategy
- 1.4 3D Structure-guided Approach
- 1.4.1 Exploring 3D Structures of Proteins
- 1.4.2 Active Site Topology/Constellation-guided Strategy
- 1.5 Conclusion
Chapter 1: Genome Mining for Enzyme Discovery
-
Published:31 May 2018
-
Special Collection: 2018 ebook collectionSeries: Catalysis Series
A. Zaparucha, V. de Berardinis, and C. Vaxelaire-Vergne, in Modern Biocatalysis: Advances Towards Synthetic Biological Systems, ed. G. Williams and M. Hall, The Royal Society of Chemistry, 2018, ch. 1, pp. 1-27.
Download citation file:
The emergence of high-throughput sequencing (or Next Generation Sequencing, NGS) in the mid-2000s has generated an incredible amount of protein and gene sequences deposited in the databases, and the number of sequences will be increasing in the future. Nucleic and protein databases are therefore gold mines for discovering novel enzymes. The first step to discover novel enzymes catalyzing a target reaction is to find a gene encoding an enzyme catalyzing that chemical transformation. There are many ways to achieve this goal. This chapter describes a panel of methods based on different approaches such as text mining, sequence identity and 3D structure of the active site or a combination thereof.
1.1 Introduction
Nature appears as the veteran protein engineer since it began its bioengineering ‘experiments’ billions of years ago.1 A number of strategies have been developed to exploit the extraordinary large source of enzymes contained in genome and metagenome sequences and to discover novel biocatalysts. Historically, strategies are based on in vivo selection on individual or collections of strains, and for a decade, on metagenomes (consortium of uncultivated microorganisms). In brief, DNA was extracted from mixed microbial populations and size-selected inserts were cloned into suitable expression vectors.2 Screening for enzymatic activities is generally performed in situ and based on an indicator medium. Positive clones are then sequenced to identify the genes of interest. Such an approach is very effective but restricted to enzymes for which a generic assay, most of the time a colorimetric one, can monitor the activities (e.g. lipases, amylases, oxidases, etc.).3 Moreover, the screening is performed without overexpression of the protein, resulting in a limited sensitivity. It is estimated that biotechnology has missed up to 99% of existing microbial resources by using traditional screening techniques, and when the desired transformation has been identified, the isolation of enzymes from wild-type strains is usually time-consuming and, most of the time, the protein loses much of its catalytic activity.4
In contrast, genome sequence information enables direct cloning of the targeted genes using the polymerase chain reaction (PCR) and thus an efficient expression in a proper heterologous host strain, even if PCR errors and/or expression drawbacks may be encountered. The emergence of high-throughput sequencing (or Next Generation Sequencing, NGS) in the mid-2000s has generated an incredible amount of sequences deposited in databases (in October 2016, around 67 000 000 of protein sequences from 509 000 species have been deposited in the TrEMBL database, http://www.uniprot.org/uniprot/ TrEMBL source) and more sequences are expected in the future. In addition, the deep sequencing of metagenomes from diverse environments offers a huge reservoir of unexploited enzymes which reflect specific metabolic requirements for a defined process [e.g. waste water treatment, bioremediation (see Chapter 18)], or a particular ecological niche.5
The exploration of the extraordinary amount of available genomic resources can be rationalized to optimize the experimental effort by computational methods that try to reveal the sequence/function relationships of proteins. In addition, the ability of an enzyme to not only transform its metabolic substrate, but also catalyze the same chemical transformation for a range of different substrates, expands the field of conversion possibilities. This property, inherent to many enzymes and outlined as a potential for biocatalyst discovery for dozens of years, expands the chemical capability of enzymes and the chemistry performed by living cells.6–9 The search for new biocatalysts is then mostly based on the hypothesis of substrate promiscuity, since unnatural substrates are often targeted in organic synthesis.
From a genome sequence, a plethora of information is available, from the function reflected in the name of an enzyme to conserved patterns/signatures and even the (predicted) structure; features that can be parsed to search for new biocatalysts.10 Most of the time, to handle the huge amount of data, the information is retrieved through comprehensive organized databases or using computational approaches. For example, the database BRENDA (BRaunschweig ENzyme DAtabase; www.brenda-enzymes.org) contains extensive details on a full suite of known enzyme substrates, thus providing comprehensive indications about the biocatalytic potential of enzymes.11
All these data offer a great potential to identify new enzymes for biotechnological applications: discovery of novel enzymes with new properties, enhanced or inverted (chemo-, regio-, stereo-) selectivity, altered pH- or temperature profile, improved stability (temperature, solvent etc.), substrate or product inhibition, enhanced catalytic efficiency.
In this chapter, we will present the different genome mining approaches for enzyme discovery. Case studies relevant to synthetic applications will be described.
1.2 Text-based Searches Using Enzyme Name
Protein sequences obtained from sequencing of single gene, entire genome or microorganism consortia are available through public databases such as UniprotKB or NCBI databases. Functional prediction of proteins (name annotation) is mainly performed automatically by sequence comparison with already annotated enzymes. In silico screening of public databases for a specific enzyme using the name of the enzymatic function as query has long been one of the easiest way to find new enzymes. However, this approach suffers from two main drawbacks: (1) the experimentally established functions only concern a tiny fraction of the enzymes, since function is mainly extrapolated from a small number of characterized proteins (partially inventoried in the Swissprot section of Uniprot); (2) it is limited by the lack of novelty in the features of the newly identified enzymes.12
The explosion of the amount of data produced by the NGS generates a growing number of sequences without reliable annotation. Today, nearly 40% of the sequences stored in the most comprehensive protein database UniProtKB are labeled as “uncharacterized protein.” In addition, at least 20% of assigned functions are estimated to be wrong in databases.12 Consequently, the annotation query approach does not make the most out of the data potential, and this will be even truer in the future. Nonetheless, it has been successfully applied in many projects regarding various enzyme families including, but not limited to, nitrilases, cytochromes P450, glycosylhydrolases and halogenases, from fungal and bacterial origin.13–19 It may also be used as first-line search then sharpened with additional criteria, e.g., sequence alignment, conserved motifs/key residues, phylogeny or 3D model analysis, to search for particular enzyme features, substrate specificity, thermophily, or new homologs of known enzyme with broad biotechnological applications.20–25
Thus, by combining an annotation query with analysis of the genomic context of the putative gene of interest, Zhu et al. discovered a nitrilase highly active towards the targeted substrate mandelonitrile (Figure 1.1).26 In prokaryotic genomes, the genes encoding enzymes involved in the same biosynthetic pathway are generally co-localized in gene clusters; therefore analysis of the genomic organization (genes in their genomic neighborhood) of one gene within the chromosome provides clues on the natural function of the encoded enzymes, especially regarding their substrates. Firstly, a query using “nitrilase” as an identifier to the NCBI Gene database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) gave hits which were sorted to keep only bacterial hits containing a carbon–nitrogen hydrolase domain and 300–385 amino acid length. Among the remaining 16 putative nitrilase genes, the genomic context of the bll6402 gene from B. japonicum USDA110 suggested a mandelonitrile metabolic pathway. The bll6402 nitrilase was indeed found to be a mandelonitrile hydrolase, which effectively catalyzed the hydrolysis of mandelonitrile and derivatives to the corresponding carboxylic acids.
Discovery of a nitrilase active towards mandelonitrile by combining annotation query with analysis of the genomic context.
Discovery of a nitrilase active towards mandelonitrile by combining annotation query with analysis of the genomic context.
Given that many open reading frames have no predicted function or are incorrectly annotated, the text-based approach appears to be less attractive in the context of broad research projects. To continue to be viable, this method should be accompanied by an effort to accurately annotate enzyme families. There are some curated databases in which it can be of interest to perform text-based search, examples include the Carbohydrate-Active enZYmes Database (CAZy: http://www.cazy.org/) or the Lipase Engineering Database (http://www.led.uni-stuttgart.de/).
1.3 Sequence-driven Approaches
This section presents the different ways to access new enzymes by using at least one described protein or nucleotide sequence. These approaches certainly belong to the most common ones to discover new enzymes. It is based on the analysis of the primary sequence of proteins as a whole or in specific portions.
With a characterized enzyme and related gene as a starting point, new enzymes performing the same or similar reactions can be identified. The gene sequence encoding the known protein can be used as template to target unsequenced genes by designing primers for their amplification or to target already sequenced genes by pairwise protein sequence alignment tools. These approaches explore the steadily expanding protein-sequence data space and open the way for the efficient discovery of novel biocatalysts.
1.3.1 Probe Technology Based on PCR Primer Design
Sequence homology-based screening, involving polymerase chain reaction (PCR)-based approaches targeting novel genes similar to already known ones, is very fruitful. It also enables exploration of complex DNA mixtures from metagenomes for which no genomic data is available. This approach is based on the design of PCR primers using as the template a parental gene encoding for an enzyme with similar catalytic activity to that of the targeted enzyme. These primers are then degenerated, allowing a set of primers with all combinations of alternative triplets of nucleotides encoding each codon at all positions. These primers are used for PCR amplification of close homologous genes.
Since most of the time these primers are designed in inner conserved regions, the new genes are often only partially amplified and the gene sequences are usually completed by inverse PCR to obtain the flanking sequences. The full gene is later cloned into an expression vector. For example, new type I BVMO genes were discovered by using highly degenerate oligonucleotides. Those primers were used for amplification of internal conserved region of type I BVMO genes present in total DNA isolated from strains able to grow on alicyclic compounds.27 As mentioned above, this method is largely applicable for metagenome exploration, as illustrated by the discovery from marine metagenomes of new laccases with alkalescence-dependent activity, by using highly degenerate primers designed to target the conserved region in copper-binding sites of the laccases.28 Other enzymes from different families were discovered through this approach, such as alcohol dehydrogenases,29 lipases,30 cytochromes P450,31 2,5-diketo-d-gluconic acid reductases32 or alpha-amino acid pyruvate transaminases.33
By combining this probe technology with genomic context analysis, new enzymes have been identified in strains producing novel natural products (Figure 1.2). The paradigm used in this approach was that products with identical or similar structural elements are produced by biosynthetic pathways that contain highly homologous enzymes.34 For example, halogenases were used as suitable targets to predict the biosynthetic potential of different Actinomycetes strains. It was predicted that Actinomycetes that harbor putative halogenase sequences in a particular genetic context have the potential to synthesize compounds that belong to the respective natural product class.
Combined strategy to identify new natural products (inspired by the Pelzer study34 ).
Combined strategy to identify new natural products (inspired by the Pelzer study34 ).
Nevertheless, it should be noted that the PCR probe-based screening approach is inherently conservative since the primers reflect conserved amino acid sequence motifs to match the targeted genes with reasonable likelihood.
1.3.2 Pairwise Sequence Alignment-based Strategy
To identify new enzymes, a “sequence identity” strategy can be applied among the complete nucleotide and protein sequences indexed in databases. Unlike the probe technology described earlier in this chapter, in the “sequence identity” strategy, the search is conducted through available genomic data. One practical advantage is that the genes can easily be cloned from the original source organism or synthesized. Enzymes with experimentally validated catalytic activity are usually used as queries for pairwise protein sequence alignment (as BLAST algorithm, https://blast.ncbi.nlm.nih.gov/Blast.cgi), mainly against protein databases such as UniprotKB or GenBank. The modulation of BLAST parameters, for example the percentage of identity, allows retrieval of sequences more or less similar to the query enzymes.
When many sequences are then retrieved, one can reduce their number to proteins representative of the functional diversity of the enzyme family. This can be guided by a partition of the family into putative iso-functional subfamilies. The simplest method is a clustering based on protein sequence identity. Other methods can be used, such as phylogenetic analyses, genomic contexts or structural classification (Figure 1.3). Candidate enzymes are then selected from each hypothesized iso-functional group in order to clone their corresponding genes.
Pairwise sequence alignment-based strategy for new enzyme discovery.
In addition, some research groups used refined criteria to select enzymes with very little similarity to the biocatalysts used as parental sequences, like the construction of a dendogram or phylogenetic tree of the candidate proteins, to catch variabilities in substrate ranges.35 This selection can also be restricted to a subgroup of strains with particular features such as thermophily to find thermostable/thermoactive enzymes.10,24
Selected genes are then amplified from genomic DNA by PCR amplification and cloned into expression vectors. The number of genes to be cloned depends on the availability of the original DNA strain from research group or institution strain collections, or from commercial suppliers. If the DNA is not available, the gene can be artificially synthesized. Efficient production of heterologous proteins in the host organism can be limited however by the rarity of certain tRNAs that are abundant in the organisms from which the heterologous proteins are derived. Forced high-level expression of heterologous proteins can deplete the pool of rare tRNAs of the host organism and stall translation. Gene sequence optimization with synthetic genes, by re-assigning codon usage to the one of the host organism, can allow increase in the over-production of heterologous enzymes. Nevertheless, currently, this is not economically viable for providing a large number of genes and the use of optimized expression host organisms for rare codons can be an issue. A number of technical improvements in molecular biology, including Ligation Independent Cloning (LIC), have permitted the development of efficient universal cloning well adapted to automated cloning processed in 96 microwell plates.36 The development of cell-free protein synthesis systems avoiding cloning, cell transformation, induction and culture steps, allows direct expression of enzymes in microtiter plates. This high throughput method was used for the discovery of new omega-transaminases among a microbial community.37
Among the many enzymes discovered thanks to this approach, we can highlight the work done on the nitrilase family.38 Database searches using the BLASTP (protein query against protein database) programs and the multiple alignment of amino acid sequences COBALT (Constraint-based Multiple Alignment Tool) enabled the discovery of new nitrilases in filamentous fungi.14 Similar approach gave access to a nitrilase toolbox for the organic chemist.39,40 Recently, Guérard-Hélaine et al. have screened a large collection of aldolases. A liquid chromatography–mass spectrometry assay led to the discovery of new dihydroxyacetone (DHA) aldolases, wrongly annotated as transaldolases (Figure 1.4).41
By using the Vibrio fluvialis ω-transaminase (ω-TA) sequence, Kaulmann et al. recruited 15 new ω-TAs with low sequence identities (31–38%). Among them, the ω-TA from Chromobacterium violaceum was found useful for the stereoselective amination of a large substrate range, in particular for ketodiol conversion.42 Lavandera et al. found an exceptionally solvent-tolerant alcohol dehydrogenase from Paracoccus pantotrophus.43 First, they selected clones of Paracoccus pantotrophus DSM11072 for their oxido-reductive behavior. From the sequenced genomes, they identified a short-chain alcohol dehydrogenase gene. The corresponding enzyme stereoselectively reduced various ketones in organic co-solvents.
It should be noted that the sequence identity searches, depending on the requirements, can be done in different ways: queries with moderate to high sequence identity (>50%) will rather provide enzymes with similar substrate specificity subtypes but different regio/stereoselectivities or catalytic turnovers. Recent reports for transaminases with 95% sequence identity and for nitrilases are good examples.44,45 As reported by Veselá et al., genes coding for putative nitrilases with moderate similarities (52–69%) to known nitrilases were selected by mining the GenBank database, synthesized artificially and expressed in Escherichia coli. Their substrate specificities were determined, which allow classification of the enzymes according to their subtypes (aromatic nitrilase, arylacetonitrilase, aliphatic nitrilase, cyanide hydratase). Those substrate profiles were largely in accordance with those predicted from bioinformatic analysis (Table 1.1).45
Substrate profile of nitrilases and preparative hydrolysis of dinitriles.
. | . | Catalytic activity (U mg−1 of protein) . | . | |||
---|---|---|---|---|---|---|
Enzyme . | Predicted substrate specificity subtype . | . | . | . | . | Products (ratio, yield%) . |
NitAk1 | aromatic nitrilase | 0.3–39 | 0–2.6 | 0.4–0.7 | 0 | |
NitAd | arylacetonitrilase | 0.03–2.9 | 2.6–64 | 16–43 | 0 | |
NitAk2 | arylacetonitrilase | 0.02–3.9 | 1.6–40 | 7.1–34 | 0 | |
NitMp | arylacetonitrilase | 0.08–9.9 | 9.4–241 | 59–159 | 3.5 |
. | . | Catalytic activity (U mg−1 of protein) . | . | |||
---|---|---|---|---|---|---|
Enzyme . | Predicted substrate specificity subtype . | . | . | . | . | Products (ratio, yield%) . |
NitAk1 | aromatic nitrilase | 0.3–39 | 0–2.6 | 0.4–0.7 | 0 | |
NitAd | arylacetonitrilase | 0.03–2.9 | 2.6–64 | 16–43 | 0 | |
NitAk2 | arylacetonitrilase | 0.02–3.9 | 1.6–40 | 7.1–34 | 0 | |
NitMp | arylacetonitrilase | 0.08–9.9 | 9.4–241 | 59–159 | 3.5 |
On the other hand, biocatalysts with different substrate scopes are targeted in lower sequence identity queries, as shown by Furuya and Kino in their paper on the discovery of novel cytochromes P450.35
For a focus on a particular catalytic activity, some specific databases have been created. An internet resource dedicated to imine reductases (Imine Reductase Engineering Database, https://ired.biocatnet.de) was recently established by BLAST search with the amino acid sequence of the first characterized IRED reported from Streptomyces in 2011.46,47 In conjunction with selectivity and structural data, a more sophisticated sequence similarity network analysis helps to predict (R)- or (S)-selectivities from sequence alone.48 This database is now part of a bigger database – BioCatNet – (https://www.biocatnet.de) combining sequences, structures and experimental data on various protein families with the aim of facilitating protein engineering.49
In the preceding examples, queries were conducted with characterized enzymes and their related genes. When looking for enzymes catalyzing an uncommon reaction, with no identified gene, mining the metabolic databases using the searching module based on reactant and product substructures could allow identification of targeted enzymes. Thanks to this strategy, wild amine dehydrogenases were very recently discovered; it is worthy to note that the preceding examples of such enzymatic reductive activity only dealt with engineered amino acid dehydrogenases.50
Additionally, industrially relevant biocatalysts were found. This is particularly the case for arylesterases found by Wang et al.51 From 74 proteins screened for esterase activity, they identified three enzymes (RpEST-1, RpEST-2 and PpEST-3) active towards p-nitrophenyl esters, displaying the best combination of catalytic activity, thermal stability and solvent stability. Another example is natural reductase from Candida glabrata discovered by Ma et al.52 From six enzymes screened for their reductive activity, CgKR1 from Candida glabrata exhibited very high activity towards methyl o-chlorobenzoylformate. This allowed the preparative synthesis of methyl (R)-o-chloromandelate (CMM), precursor of the widely used platelet aggregation inhibitor clopidogrel, at 300 g L−1 scale (Scheme 1.1).
Preparative synthesis of (R)-o-chloromandelate. CgKR1: carbonyl reductase from Candida glabrata.
Preparative synthesis of (R)-o-chloromandelate. CgKR1: carbonyl reductase from Candida glabrata.
This sequence-comparison approach is a way of getting around the constraint of annotation, as proteins with false annotation or without predicted function are retrieved.41 This aspect was stressed by Zhu et al. in their report on the novel γ-lactamase from Bradyrhizobium japonicum USDA 6, previously unnamed, showing 49% identity to the query protein.53
The strength of this approach was illustrated by a bioinformatic strategy integrating the different clustering systems and set up to investigate a Pfam family. In the present case, it turned out that the discovered enzymes were not relevant for biotechnological purposes, but this integrated strategy could be applied to evaluate the biocatalytic potential of any family.54,55
1.3.3 Signature-/Key Motif-based Strategy
Rather than conducting the search for new enzymes using the primary sequence of a protein as a whole, one can focus on specific portions of this sequence. Proteins are classified into families based on the presence of important domains or conserved sequence features. These signatures are built using different computational approaches that usually use as a starting point a multiple sequence alignment of proteins sharing a set of characteristics. In InterPro, a database providing functional analysis of proteins, patterns, profiles, fingerprints and hidden Markov models (HMMs) from a number of different databases, are brought together into a single searchable resource, offering convenient access to their predictive capabilities (Figure 1.5).56
InterPro member databases grouped by signature construction method.
Moreover, specific sequence motifs/key residues can allow the unambiguous discrimination of the targeted family from the vast number of other sequences among a superfamily, as illustrated for halohydrin dehalogenases belonging to the short-chain dehydrogenase/reductase (SDR) superfamily.57 Therefore, protein signatures can be very relevant for the discovery of new enzymes. A few groups have employed a sequence pattern search which proved quite fruitful. In a seminal paper of 2005, Fraaije et al. identified novel Baeyer–Villiger monooxygenases (BVMOs) thanks to the protein sequence motif [FXGXXXHXXXW(DP)] described earlier by the same group. PAMO, a thermostable monooxygenase from Thermobifida fusca, was found to be highly active towards phenylacetone (kcat/Km=32 000 M−1 s−1). The authors pointed to the difficulty to predict the enzyme substrate specificity on the basis of its sequence; indeed, despite its high sequence identity (53%) with steroid monooxygenase, PAMO has no catalytic activity towards progesterone.58 More recently, Wetzl et al. used HMM of the C-terminal domain of six known IREDs to find new ones of bacterial origin. After clustering of the protein sequences matching this HMM hypothesized to be responsible for the catalytic properties, enzymes representative of the IRED sequence space were selected and tested. Interestingly, relationships between stereochemistry, substrate structure and clustering have been observed, as illustrated with IR-10/12 and 14 in Figure 1.6.59 Two very recently identified IRED-specific motifs, the cofactor binding motif GLGxMGx5[ATS]x4Gx4[VIL]WNR[TS]x2[KR] and the active site motif Gx[DE]x[GDA]x[APS]x3{K}x[ASL]x[LMVIAG], should help to discover many more IREDs.60
Examples of IRED discovered by consensus C-terminal domain search.
Similarly, the first discovery of Fe-type nitrile hydratase was recently achieved using a conserved motif in alpha-subunit as probe. This motif – KNVIVCSLCSCTAWPILGLPPTWYKSFEYRARVVREPR – containing the iron-binding motif CSLCSC, was selected after sequence alignment of all characterized Fe-type NHases. The nitrile hydratase from Pseudomonas putida F1 showed efficient catalytic properties on small aliphatic nitriles but also on some aromatic nitriles.61
Arylmalonate decarboxylases (AMDases, EC 4.1.1.76) are very rare and their ability to decarboxylate α-disubstituted malonic acid derivatives to optically pure products without cofactors makes them attractive and promising candidates for the use as biocatalysts in industrial processes. In a comprehensive survey, through the development of a search algorithm, Maimanakos et al. identified sequence patterns in AMDases, allowing them to discover 58 new AMDases from genomes and metagenomes (Scheme 1.2).62
Arylmalonate decarboxylase reaction. Conversion of 2-aryl-2-methyl malonate to (R)-arylpropionic acid, illustrated by naproxen. AMDase: arylmalonate decarboxylase.
Arylmalonate decarboxylase reaction. Conversion of 2-aryl-2-methyl malonate to (R)-arylpropionic acid, illustrated by naproxen. AMDase: arylmalonate decarboxylase.
An interesting example in the field of enzymes with industrial potential is the carboxylic acid reductase (CAR) from Mycobacterium marinum discovered by Akhtar et al. This protein was selected because it holds three consensus sequences characteristic of a previously characterized CAR enzyme: (i) ATP domain, (ii) phosphopantetheine attachments site (LGGXSXXA) and (iii) Rossman fold for NADPH binding. It was found to convert a wide range of aliphatic fatty acids (C6–C18) into corresponding aldehydes, making it a useful catalyst for the synthesis of fatty acid-derived chemical commodities.63
In the next few years, this approach relying on small portions of the sequence directly associated with the enzymatic activity (co-factor binding sites, binding residues, catalytic sites, etc.) is expected to develop thanks to the growing number of experimentally validated sequences, which should lead to the discovery of new motifs/signatures inside enzyme families.
1.4 3D Structure-guided Approach
Although very substantial, the sole analysis of the primary sequence for predicting the functions of genes generally leads to the discovery of enzymes belonging to already known families. The Brookhaven Protein Data Bank (RSCB PDB http://www.rcsb.org/pdb/home/home.do) indexes the 3D structural data of biological macromolecules, mainly proteins. With nearly 115 000 protein entries, it only represents a tiny portion of the sequences listed in the protein databases, but constitutes a great resource for the discovery of novel enzymes. Thanks to various structural genomic initiatives, the number of protein structures is steadily increasing; however, a significant proportion of them are annotated as proteins of unknown function.64 For two decades, a lot of effort has been made to develop bioinformatic tools to predict the function of a protein from its 3D structure. In their pioneering work, the Thornton group proposed a methodology based on the recognition of 3D active site templates.65–67 The Catalytic Site Atlas (CSA), a database providing catalytic residue annotation for enzymes in the Protein Data Bank, was built.68,69 Other approaches, also based on the conservation of residues in space associated with the catalytic function of the protein, have been developed by different groups.70–72
1.4.1 Exploring 3D Structures of Proteins
Besides the common sequence-based approach, a structure-guided approach can be undertaken for enzymes belonging to families or sub-families for which at least one member has a solved structure. In this way, four structures with unknown function belonging to the cluster of “ornithine-aminotransferase (OAT)-like proteins” were identified as amine transaminases (ATA) and biochemically characterized by Steffen-Munsberg et al.73 Two of the four enzymes were found to be rather promiscuous with moderate to excellent activities towards structurally various substrates associated with good to total stereoselectivity (Scheme 1.3).
Conversion and enantiomeric excess values of the amines obtained by asymmetric synthesis from the corresponding ketones. ATA: amine transaminase, LDH: lactate dehydrogenase.
Conversion and enantiomeric excess values of the amines obtained by asymmetric synthesis from the corresponding ketones. ATA: amine transaminase, LDH: lactate dehydrogenase.
Exploration of both structural and sequence databases allowed the same group to report the identification of so-far unknown (R)-stereoselective transaminases.74 Careful analysis of the 3D structures of enzymes belonging to fold IV PLP-enzymes led to the identification and prediction of key motifs for stereoselectivity. A sequence-based algorithm was developed and used to search protein databases for enzymes carrying these key motifs. Seventeen enzymes were found to have the desired features.
1.4.2 Active Site Topology/Constellation-guided Strategy
Focusing on highly conserved regions associated with catalysis has led to the development of protein function annotation algorithms that specifically focus on matching catalytic residue geometries.75,76 As in de novo computational enzyme design (see Chapter 4), the first step consists of identifying the catalytic residues, cofactor and substrate binding residues and their relative spatial positions forming the minimal catalytic active site constellation.77 This template can then be applied in searches in structural databases (Figure 1.7).
This method allowed Steinkellner et al. to identify two promiscuous ene-reductases (Old Yellow Enzyme, OYE). These two enzymes, PhENR from Pyrococcus horikoshii and TtENR from Thermus thermophilus, have completely different sequence and fold compared to typical OYEs. Remarkably, an inverted stereopreference was predicted and experimentally confirmed (Scheme 1.4). As outlined by the authors, one of the benefits of the approach lies in the fact that all the hits are proteins stable enough to have their structure determined.78
Conversion and enantiomeric excess values of the compounds obtained by asymmetric reduction from the corresponding enones. PhENR: ene-reductase from Pyrococcus horikoshii, TtENR: ene-reductase from Thermus thermophilus.
Conversion and enantiomeric excess values of the compounds obtained by asymmetric reduction from the corresponding enones. PhENR: ene-reductase from Pyrococcus horikoshii, TtENR: ene-reductase from Thermus thermophilus.
Also based on active site topology, another approach focuses on mechanistic and structural characteristics. Partial reaction, reaction intermediate or transition state, are elements to identify enzymes sharing common mechanistic attribute within a protein family.79–81 Thus, by modeling high-energy intermediates that mimic the transition state, Hermann et al. established the function of Tm0936, an enzyme of unknown function from Thermotoga maritima belonging to the amidohydrolase superfamily, from its structure.82 Homologs of the Tm0936 were later identified by docking putative substrates to modeled enzyme structures.83 Recently, by such a mechanism-guided approach focusing on stabilization of the transition state, Kürten et al. showed that an esterase with the α/β-fold displayed reaction promiscuity and exhibited amidase activity.84
All these methodologies for the functional assignment of proteins of unknown functions have been developed for a better understanding of metabolism and enzymology from a fundamental perspective. Until now, they have hardly been used in programs for the identification of enzymes with biotechnological applications (see Chapter 2), but without doubt they present a high potential.
1.5 Conclusion
Owing to the tremendous microbial biodiversity, the various genome mining approaches described in this chapter emphasize the great potential of nature as a reservoir of biocatalysts. The large number of available sequences can be seen as a technical obstacle for the discovery of the desired catalytic activity, but rational genome mining (i.e. combining traditional genome sequence comparison or annotation work with other approaches, such as identification of key motifs, functional analysis of the genetic organization of the putative targeted gene, 3D structure-guided strategy) can optimize the screening effort.
Besides the powerful protein engineering methods based on modification of already characterized enzymes (see Chapter 7), mining genomes still constitutes a powerful and complementary way to access new biocatalysts. In protein engineering studies, parental sequences generally govern the overall accessible sequence space, while novel sequences can offer high variability and exhibit characteristics that are difficult to access by laboratory evolution. In addition, identification of new enzymes with at least some marginal activities provides quickly the most efficient starting point for enzyme improvements by protein engineering, which generally remains necessary to obtain industrial biocatalysts.85
In 1976, Jensen proposed that ancient enzymes were characterized by broad substrate and reaction scope (“generalist enzymes”) and that natural enzyme evolution picked up and fine-tuned these different activities to generate contemporary enzymes with specific catalytic functions.86 In addition, to re-specialize the function of an enzyme, it seems that specialized enzymes are first de-specialized by natural mutations to obtain a generalist enzyme (promiscuous enzyme), before being re-specialized to a new function.8,87 Nowadays, inspired by natural enzyme evolution, the identification of all-rounder frequent hit enzymes as generalist enzymes is sought as a framework for protein engineering.88–90
It is highly probable that many valuable enzymatic activities have still to be found among wild-type enzymes resulting from billions of years of natural evolution. To explore this reservoir of uncharacterized enzymes, new ways to make use of nature's richness are required. Within this framework, modern bioinformatic tools, barely utilized in the frame of biocatalysis, will certainly be of great benefit in the near future for the discovery of new biocatalysts. We can mention antiSMASH for the assignment of enzyme functions among biosynthetic gene clusters identified by automatic genomic programs, or the conservation of types of chemical transformations among metabolic networks, which enables capture of relevant metabolic contexts and thus assignment of potential enzymatic function.91,92 Moreover, the profiling of orphan enzymes will provide more activity-attributed sequences, thereby multiplying the reference sequences available for the discovery of novel biocatalysts by sequence comparison approaches.