Ivano
Bertini
*ab and
Gabriele
Cavallaro
a
aMagnetic Resonance Center (CERM) – University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino, Italy. E-mail: bertini@cerm.unifi.it; Fax: +39 055 4574271; Tel: +39 055 4574272
bDepartment of Chemistry – University of Florence, Via della Lastruccia 3, 50019 Sesto Fiorentino, Italy
First published on 29th September 2009
Bioinformatics is a central discipline in modern life sciences aimed at describing the complex properties of living organisms starting from large-scale data sets of cellular constituents such as genes and proteins. In order for this wealth of information to provide useful biological knowledge, databases and software tools for data collection, analysis and interpretation need to be developed. In this paper, we review recent advances in the design and implementation of bioinformatics resources devoted to the study of metals in biological systems, a research field traditionally at the heart of bioinorganic chemistry. We show how metalloproteomes can be extracted from genome sequences, how structural properties can be related to function, how databases can be implemented, and how hints on interactions can be obtained from bioinformatics.
![]() Ivano Bertini | Ivano Bertini is Professor of General and Inorganic Chemistry at the University of Florence and is Director of the Magnetic Resonance Center (CERM). He has received several honors, among which are three Laurea Honoris Causa (from the universities of Stockholm, Ioannina and Siena). He is a member of the Academia Europaea and the Italian Accademia dei Lincei, and is, or has been, on the editorial staff or advisory board of over 20 of the most authoritative chemistry and biochemistry journals. Since 1975 he has studied the structure–function relationships of metalloproteins through biophysical methods. In 1990, he created an NMR lab for structural biology of metalloproteins, and eventually pioneered the exploitation of genome data banks. He has pursued advancements in technology for solution structure determination and developed specific software applications. He has also established a molecular biology department for high-throughput protein expression in structural genomics projects on metalloproteins. He has published over 600 research articles and has solved more than 100 protein structures. In 1999 he founded the CERM in an independent and prestigious building hosting an impressive battery of NMR spectrometers. The Center constitutes a major NMR infrastructure in the Life Sciences. |
![]() Gabriele Cavallaro | Gabriele Cavallaro was born in Florence in 1973. He graduated in chemistry from the University of Florence (Italy), where he received his PhD in structural biology in 2004. He is now a postdoctoral fellow at the Magnetic Resonance Center (CERM) in Florence. His research interests include several areas of computational biology ranging from molecular dynamics simulations to protein structure calculation. At present, he is mainly involved in the development of bioinformatics tools for sequence and structure-based analysis of metalloproteins, and is collaborating with the European Bioinformatics Institute (EBI) to annotate and classify metal-binding sites in proteins. |
In this review, we describe some bioinformatics approaches tailored for application in the various stages of the study of metals in biological systems, as outlined above. We focus on metalloproteins rather than on metal species in general, because data sets of metalloproteins are most suitable to be integrated with other types of biological information, which are mostly centred on proteins. First, we illustrate how bioinformatics can be used to identify, on the basis of sequence alone, which proteins of a given organism are metalloproteins, thus providing a prediction of its metalloproteome. Second, we show how bioinformatics can exploit the available protein structure information for more reliable identification and classification, including functional inference, of metalloproteins. Third, we describe actual and potential bioinformatics resources designed to contain and organize chemical and biological information on metalloproteins. Finally, we discuss how metalloproteomics data can be combined with other genome -scale data sets into models of various complexity, with the ultimate perspective of obtaining an overall picture of metals in living organisms.
In the last few years, our group has developed a bioinformatics method, applicable on a whole-genome scale, to predict the metal-binding properties of a protein from its amino acid sequence.16–20 The method, which is schematically depicted in Fig. 1, is based on the recognition of two different signatures diagnostic for metal binding, which are used in combination. Specifically, the protein sequence under consideration is analysed for the presence of (i) metal-binding domains, and (ii) metal-binding patterns, as detailed below.
![]() | ||
Fig. 1 Scheme depicting our method for the identification of metalloproteins from amino acid sequences (Me = any metal). Primary information sources for protein structures (PDB) and domains (Pfam) are examined by a mixture of manual and automated analysis to build libraries of metal-binding patterns (in the form of regular expressions) and metal-binding domains (in the form of profile HMMs). On average, about 80% of the latter have a pattern associated with them. The complete predicted proteome of an organism is then automatically scanned for protein sequences containing metal-binding domains (using HMMER) or metal-binding patterns (using flexible pattern matching). Protein sequences containing a metal-binding domain but lacking the associated pattern (if available) are filtered out. The relative proportion of metalloproteins detected on average by the identification of both domains and patterns (about 75%), of domains only (about 20%), and of patterns only (about 5%) is shown by the size of each circle. |
The identification of conserved domains is a well-established approach for the classification of protein sequences, which are scanned for matches against libraries of known domains.21 A widely used resource for this purpose is the Pfam database,22 a comprehensive collection of protein domains represented as profile hidden Markov models (HMMs)23 which can be analyzed for similarity to a query sequence using the HMMER software package.24 The assignment of the query sequence to a specific domain depends on whether the similarity score calculated by HMMER, usually expressed as an E-value, exceeds an appropriate threshold (e.g., E < 0.001). In order for this approach to be used for predicting whether a protein binds a given metal, therefore, it is necessary to construct a library of all Pfam domains that bind that metal. In our method, such metal-specific libraries are constructed by querying Pfam for those domains whose annotation contains the name and/or the symbol of the metal, and then checking the literature to filter out incorrectly extracted domains. Additionally, domains that have been structurally characterized as metal-binding but are not annotated as such in Pfam (and thus cannot be detected as above) are also included. These domains are identified by retrieving from the Protein Data Bank (PDB) all protein structures that bind the metal of interest, and then using HMMER to compare their sequences against the Pfam database. Pfam domains recovered by this procedure typically represent a small, but not negligible fraction of the metal-specific library: in the case of zinc-binding domains, for example, they were found to be 25 out of a total of 314 (i.e., approximately 8%).18 It is important to note that the collection of metal-binding PDB structures also requires analysis of the literature, so as to discard the proteins bound to non-physiological metals (e.g., due to adventitious binding during crystallization or purification procedures). Therefore, when the approach is applied for the first time to a given metal, the construction of the domain library turns out to be the most time-consuming stage of the entire protocol, requiring manual curation by experts in bioinorganic chemistry. In subsequent implementations, on the other hand, libraries just need to be updated to reflect changes in Pfam and in the PDB, and much less human intervention is needed. It is worth noting that in some cases literature analysis may not lead to definite conclusions about the metal-binding properties of a given domain, particularly in the absence of relevant structural information. For example, biochemical studies25 indicated that CHCH (standing for coiled coil-helix-coiled coil-helix) domains can bind copper by two conserved Cys residues which, however, appear to serve a purely structural role in a recently determined CHCH structure,26 suggesting that CHCH domains may not generally bind copper. Therefore, updates of domain libraries may involve not only the addition of previously uncharacterized metal-binding domains, but also the removal of domains whose metal-binding properties are proven incorrect by new evidence.
With a library of metal-binding Pfam domains available, HMMER can be used in a systematic manner on large protein sequence data sets, such as the complete predicted proteome of an organism, to select those containing at least one such domain. The selected proteins would thus represent the estimated metalloproteome of the organism. However, there is a major problem in predicting the metal-binding capability of a protein based solely on the fact that it contains a domain which is commonly associated with a certain metal. In fact, since this capability depends on the presence in the domain of a metal-binding site formed by a (small) number of residues in the proper spatial arrangement, evolutionary changes affecting these residues (e.g., non-conservative mutations) may cause the domain to lose its metal-binding properties while retaining its overall structure and sequence features. This is the case, for instance, of certain members of the ADAMs family of proteins27 which, despite containing a conserved protease domain implicated in zinc-dependent catalysis, lack the His residues responsible for zinc binding and are thus unlikely to be proteolytically active.28 Therefore, using the occurrence of metal-binding domains as the only criterion to identify metalloproteins may lead to a substantial rate of false positives, i.e., of proteins incorrectly predicted to be metalloproteins. For this reason, it is necessary to supplement this criterion with additional requirements, which can reduce the number of false positives. In our method, such additional restraints are provided by metal-binding patterns.
The identification of conserved patterns is another common, and possibly the simplest, approach for the classification of protein sequences. It makes use of regular expressions to represent motifs of amino acid residues which correspond to functionally important, and thus highly conserved, regions of proteins, such as catalytic or binding sites. Regular expressions specify which residues may or may not occur at each position of the pattern: the motif characterizing the zinc-binding members of the abovementioned ADAMs family of proteins, for instance, is represented as HEXXHXXGXXH, where amino acids are given as one-letter codes, and X indicates any amino acid.29 Simple string matching algorithms can then be used to determine whether a protein sequence contains a given pattern or not. Similarly to what is described above for metal-binding domains, constructing a library of metal-binding patterns is the first step to take to exploit this approach in the prediction of metalloproteins. In our method, this library is built starting from the PDB, taking advantage of the curated selection of metal-binding structures performed to enrich the library of metal-binding domains (see above). Metal-binding patterns are automatically derived from PDB structures by mapping the metal ligands onto the protein sequence, and are expressed in the general form AXnBXmC…, where A, B, C… are the metal-binding amino acids and n, m… are the number of amino acids in between two subsequent ligands. Importantly, the extraction of metal-binding patterns from the same PDB structures that were assigned to Pfam domains allows us to associate each pattern with the domain in which it is found. As a result, all the metal-binding domains for which at least one representative with known structure is available have one metal-binding pattern (or more, if different patterns are found in different structures of the same domain) associated with them. With this information, it is then possible to filter the metalloproteins predicted on the basis of metal-binding domains by requiring that the domains identified by HMMER contain at least one of their associated metal-binding patterns. When performing pattern matching, we allow the distance in sequence between two subsequent metal ligands to vary within ±20% (or ±1 amino acid for distances less than five residues) to take into account the observation that there is little evolutionary pressure to maintain a specific sequence spacing between two metal ligands when the spacing is large (i.e., of the order of tens of residues).16 The addition of the pattern-based selection criterion dramatically improves the quality of the predicted metalloproteomes, by eliminating false positives such as the abovementioned ADAM proteins lacking the zinc-binding site. In the case of non-heme iron, for instance, the precision of the results was improved by 30% with respect to the sole domain-based prediction.19
In our method, metal-binding patterns are used not only to refine the results obtained from the recognition of metal-binding domains, but also to identify possible further metalloproteins whose sequences match one of the patterns but which do not contain any known Pfam domain, and thus cannot be detected by domain searches. The rationale behind this supplementary search is that the same metal-binding site can occur as a common element in a broad variety of proteins (e.g., due to convergent evolution),30 including proteins whose structure (and thus sequence) is unrelated to any known domain. On the other hand, however, the occurrence of a metal-binding pattern in the sequence of a protein whose structure is different from that from which the pattern was extracted may not correspond to an actual metal-binding site in three-dimensional space (see Fig. 2 for example). Therefore, simply looking for metal-binding patterns in protein sequences with no similarity to known domains can generate a high proportion of false positives. To discern such false positives, we use a parameter which accounts for the degree of local sequence similarity around the metal-binding pattern between the known metalloprotein (from which the pattern was extracted) and that predicted (where the pattern was retrieved), which are aligned using the PHI-BLAST program.31 The presence of local sequence similarity is in fact indicative of local structural similarity, implying similar conformation for the metal ligands and thus formation of the metal-binding site.32 This parameter is called IdGlobal, and is defined as the ratio between the number of amino acids aligned by PHI-BLAST and the entire sequence length of the known metalloprotein. When IdGlobal is higher than 0.2, metalloproteins are predicted with a level of confidence of over 99%, dropping to about 50% when IdGlobal is between 0.2 and 0.1, and to about 25% when it is below 0.1.16 In genome -wide predictions of metalloproteins, we select sequences with IdGlobal > 0.2, typically contributing an additional 5% to the size of the predicted metalloproteomes.19
![]() | ||
Fig. 2 Example of a metal-binding pattern which does not correspond to an actual metal-binding site in a protein whose structure is different from that from which the pattern was extracted. The CX2CX11CX13H pattern is associated with a zinc-binding site in the RING finger protein Rbx1 (PDB code 1ldj,129 left), but is not so in the Herpes virus entry mediator HveA (PDB code 1jma,130 right). |
To summarize, at the conclusion of the above procedure a prediction of the metalloproteome of an organism is obtained which consists of three subsets: one includes metalloproteins predicted by the identification of both a metal-binding domain and a metal-binding pattern, one includes metalloproteins predicted by the identification of only a metal-binding domain, and one includes metalloproteins predicted by the identification of only a metal-binding pattern. The first subset has the highest degree of confidence, fulfilling both criteria used to recognize metal-binding capabilities. The second subset follows from the fact that not all Pfam domains have a representative structure in the PDB, precluding the possibility of defining metal-binding patterns to be used as a selection criterion. From what is described above, this subset (whose size will decrease with time as more structures become available) is expected to overestimate the number of genuine metalloproteins. The size of this overestimation can be evaluated by assuming that the fraction of false positives in this subset is the same as that observed for the pattern-filtered subset. On this assumption, it can be approximated as the product of the percentage of metalloproteins filtered out by the pattern filter times the percentage contribution of the non-filtered subset to the total metalloproteome. As shown in Table 1, such percentages are different for different metals, but the overestimation is expected to be less than 5% in all cases. Table 1 also reports the estimated loss of genuine metalloproteins which would result from entirely discarding the non-filtered subset. Although this loss can be as high as 28%, such a conservative choice may be more appropriate for applications in which only high-confidence predictions are needed. Finally, the third subset follows from the fact that the space of protein sequences is not yet fully covered by Pfam domains,33 but this limitation can be overcome, at least in part, by attempting to recognize sequence regions (i.e., those corresponding to the metal-binding site) which are smaller than whole domains. The adoption of the IdGlobal parameter ensures a confidence level close to 100% for this subset, although the choice of a discrimination threshold necessarily involves discarding a number of genuine metalloproteins. The impact of this underestimation on the total metalloproteome is expected to be less than 2%, given that the fraction of true metalloproteins when IdGlobal is lower than the threshold is about 37%,16 and this subset, as mentioned above, represents around 5% of the predicted metalloproteome.
Metal | Percentage of proteins containing a metal-binding domain that were filtered out because they lack a metal-binding pattern | Percentage of metalloproteins predicted by the identification of metal-binding domains only | Metalloproteome overestimation due to the inclusion of metalloproteins predicted by the identification of metal-binding domains only | Metalloproteome underestimation due to the exclusion of metalloproteins predicted by the identification of metal-binding domains only |
---|---|---|---|---|
Zinc | 40% | 10% | 4% | 6% |
Iron | 30% | 15% | 5% | 11% |
Copper | 10% | 30% | 3% | 27% |
Resource | Type of resource | Web address |
---|---|---|
CATH | Database of hierarchically classified protein structures | http://www.cathdb.info/ |
DALI | Web server for protein structure comparison | http://ekhidna.biocenter.helsinki.fi/dali_server/ |
DIP | Database of experimentally determined protein–protein interactions | http://dip.doe-mbi.ucla.edu/dip/Main.cgi |
HMMER | Tool for identifying protein sequence similarity based on profile HMMs | http://hmmer.janelia.org/ |
IntAct | Database of known and predicted protein–protein interactions | http://www.ebi.ac.uk/intact/main.xhtml |
Integr8 | Database of integrated information about deciphered genomes and their corresponding proteomes | http://www.ebi.ac.uk/integr8/EBI-Integr8-HomePage.do |
InterPro | Integrated documentation resource for protein families, domains, regions and sites | http://www.ebi.ac.uk/interpro/ |
JESS | Tool for searching structural motifs in protein structures | Available from the authors64 |
MDB | Database of metal-binding sites in protein structures | http://metallo.scripps.edu/ |
Metal-MACiE | Database of information on the catalytic mechanisms of metal-dependent enzymes | http://www.ebi.ac.uk/thornton-srv/databases/Metal_MACiE/home.html |
MINT | Database of experimentally determined protein–protein interactions | http://mint.bio.uniroma2.it/mint/Welcome.do |
PDB | Database of experimentally determined structures of proteins, nucleic acids and complex assemblies | http://www.rcsb.org/pdb/home/home.do |
PDBeMotif | Integrated database and analysis tool of structural motifs in protein structures | http://www.ebi.ac.uk/pdbe-site/pdbemotif/ |
PDBSiteScan | Web server for searching structural motifs in protein structures | http://wwwmgs.bionet.nsc.ru/mgs/gnw/pdbsitescan/ |
Pfam | Database of protein domains, represented by multiple sequence alignments and HMMs | http://pfam.sanger.ac.uk/ |
PHI-BLAST | Web server for searching protein databases for proteins that contain a specific pattern and are similar to the query sequence in the vicinity of the pattern | http://blast.ncbi.nlm.nih.gov/Blast.cgi |
PINTS | Web server for searching structural motifs in protein structures | http://www.russell.embl.de/pints/ |
ProFunc | Web server for structure-based prediction of protein function | http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/ |
PROMISE | Database of literature-based annotation of metalloproteins | http://metallo.scripps.edu/PROMISE/ |
SCOP | Database of hierarchically classified protein structures | http://scop.mrc-lmb.cam.ac.uk/scop/ |
SSM | Web server for protein structure comparison | http://www.ebi.ac.uk/msd-srv/ssm/ |
STRING | Database of known and predicted protein–protein interactions | http://string-db.org/ |
The most obvious limitation of our method is that it cannot detect metalloproteins which do not contain either a known metal-binding domain or a known metal-binding pattern. In a way, this can be seen as a general limitation of predictive computational approaches, all of which ultimately depend on the available knowledge. However, the number of structurally novel metalloproteins being discovered each year, which may extend the knowledge on which our method is based, is progressively reducing, suggesting that future predictions will bring about relatively small corrections to those made today.32 An alternative means to tackle the challenge of identifying metalloproteins, which in principle can detect also those with unknown domains and patterns, is provided by machine learning techniques such as neural networks and support vector machines (SVMs).34–36 These methods do not, in fact, escape the rule that what can be predicted depends on what is known, but are often referred to as ‘de novo’ methods because they are able to find non-trivial correlations between sequence features and protein properties, extrapolating regularities which cannot be encoded in profiles or regular expressions. A number of machine learning approaches to predict metalloproteins have been developed in the last few years,37–40 sometimes coupled to standard similarity-based predictions,41 and have been shown to be indeed able to identify unprecedented metal-binding sites.39,40 Other attempts have also been reported based on the identification of metal-binding domains alone,42 as well as of metal-binding patterns alone.43 It is likely, however, that such predictions based on the recognition of a single signature may suffer from the limitations described here, which need to be overcome by the integration of independent, complementary tools.
It is well known that the function of proteins is closely associated with their specific tertiary structures and, as a consequence, structural knowledge is of the utmost importance to understand protein function in detail. This statement is particularly relevant in the case of metalloproteins, where the structural and chemical features of the protein environment around the metal-binding site can finely modulate the functionality of the bound metal, which in turn is often responsible for protein function (e.g., in metalloenzyme catalysis).52 Furthermore, functional similarity can be inferred from structural similarity even when sequence similarity is undetectable, because proteins with a common evolutionary origin can maintain similar folds and functions over time even though their sequences diverge to a point where their relatedness cannot be recognized.53 For these reasons, the methods for protein function prediction based on structural information are generally more powerful than those based on sequence alone.14
Not unlike their sequence-based counterparts, structure-based approaches for predicting protein function typically rely on detecting structural similarity, global or local, between the protein under consideration and a protein of known function.54 Global similarity searches involve the use of structure alignment tools such as SSM55 and DALI56 to compare the query structure against the PDB, or against databases like CATH57 and SCOP58 where PDB structures are classified into groups depending on structural and functional similarity. The functional clues provided by these searches, however, may be of limited significance in some cases, because certain folds occur very commonly in nature and are shared by proteins carrying out a wide variety of different functions.59,60 Therefore, even proteins with very similar global structures can have very different functions as the result of local differences in functionally important regions such as catalytic or binding sites. On the other hand, convergent evolution can produce similar functional sites in proteins with unrelated folds, thus causing proteins with very different global structures to perform similar functions.61 For these reasons, it is convenient to combine global similarity searches with local ones, in which the query structure is compared against data sets of local structural motifs associated with known functional sites. In these comparisons, structural motifs are typically represented as three-dimensional templates containing the spatial coordinates of the residues that form the site (e.g., the Ser-His-Asp catalytic triad of serine proteases),62 which are matched against the query structure using tools such as PINTS63 or JESS.64 In the case of metalloproteins, in particular, local similarity searches can provide key functional hints by detecting known metal-binding sites with defined functions (e.g., electron transfer, oxygen binding). Therefore, in silico characterization of metalloproteins can especially benefit from the availability of metal-specific libraries of metal-binding structural motifs annotated with functional information.
We have recently developed a method, which is schematically depicted in Fig. 3, to build libraries of metal-binding motifs in a largely automated fashion.65 The key feature of these libraries is that metal-binding motifs are represented as three-dimensional templates that contain not only the metal ligands (i.e., the first coordination sphere of the metal) but also the residues interacting with the metal ligands (i.e., the second coordination sphere of the metal), as shown in Fig. 4. This choice reflects the notion that the second coordination sphere can contribute to modulate metal function,66,67 therefore it must be included in templates to properly describe the structural and chemical features of metal-binding sites.
![]() | ||
Fig. 3 Scheme depicting our method for the construction of libraries of metal-binding structural motifs (Me = any metal). Primary information sources for protein structures (PDB), protein structure classification (CATH, SCOP) and function (literature) are examined by a mixture of manual and automated analysis to build libraries of metal-binding structural motifs in the form of three-dimensional templates containing the coordinates of the first and the second coordination sphere of the metal. The motifs are grouped according to CATH and SCOP classification, and annotated with respect to function. Protein structures can then be scanned for similarity to known metal-binding motifs using structural alignment tools (e.g., FAST)131 to obtain functional information. Also, the data set can be automatically analysed to obtain information on various characteristics (e.g., amino acid composition) of metalloproteins binding a given metal. |
![]() | ||
Fig. 4 Example of a three-dimensional template that includes the first and the second coordination sphere of the metal to represent a metal-binding site. The iron site in phenylalanine hydroxylase (PDB code 1ltz)132 is described by the spatial coordinates of the iron ligands (shown as red sticks) plus those of the residues interacting with such ligands (magenta sticks). Iron is shown as a yellow sphere. |
In our method, protein structures that physiologically bind the metal of interest are selected from the PDB in the same way described in the previous section (in fact, the exact same collection can be used), and the structural domains present in these proteins are identified and classified according to the CATH and SCOP databases. Three-dimensional templates representing the metal-binding sites of these proteins are automatically generated from the PDB structures by extracting the spatial coordinates of the residues forming the first and the second coordination sphere of the metal, as defined above. The metal-binding sites (and thus the corresponding templates) are then associated with the structural domain in which they are found, by mapping the metal ligands onto the domains defined by CATH and SCOP. For example, the iron-binding protein desulfoferrodoxin68 contains an iron-binding site in the N-terminal domain, which is classified in SCOP as a rubredoxin-like domain, and an iron-binding site in the C-terminal domain, which is classified in SCOP as an immunoglobulin -like beta-sandwich domain. The N-terminal site is then associated with the rubredoxin-like domain, and the C-terminal site with the immunoglobulin -like beta-sandwich domain. Metal-binding sites found in the same structural domain are grouped together, allowing one to evaluate the redundancy and eventually adjust the size of the data set by selecting a number of representative sites from each group. The PDB is in fact highly redundant in that it may contain many structures of the same protein (e.g., determined under various conditions), as well as of highly similar proteins (e.g., mutants, homologues from closely related species). In the context of function prediction, the representative sites of a group must be selected so as to cover the range of different functions performed by the members of the group, i.e., by the metal-binding sites found in a given structural domain. Importantly, this grouping procedure also allows a substantial reduction in the time required to annotate the functions of metal-binding sites, because the annotation process can be performed on a per-group basis rather than for individual proteins. Still, literature analysis represents the most costly step in terms of manual effort, although (as discussed in the previous section for sequence-based libraries), it is much more time-consuming in the initial construction of the data set than in subsequent updates.
The result of the above procedure is a library, organized on a structural and functional basis, of metal-binding motifs represented by three-dimensional templates and annotated with functional information, against which protein structures can be compared to gain detailed insight into their metal-dependent functional properties. On a large scale, such comparisons can involve experimental protein structures produced by structural genomics projects,69 which are typically devoid of functional information, as well as protein structures predicted from amino acid sequences using homology modelling or fold recognition techniques.70 The extent to which complete proteomes can be analysed by this approach, therefore, is determined by the fraction of protein sequences in the proteome for which a reliable structural model can be obtained. Homology modelling can currently achieve about 65% structural coverage of whole proteomes, although this fraction can vary widely from organism to organism.71,72
In addition to serving as a reference for structural similarity searches, a library of metal-binding motifs designed as above lends itself to a number of analyses which can themselves provide useful information. For example, the templates included in the library can be compared against each other, so as to reveal common metal-binding motifs in structurally unrelated proteins, and highlight fine differences in the metal-binding sites of proteins with similar structures but different functions. This analysis can thus give interesting hints on the mechanism of action and the evolution of metalloproteins. Also, the library can be examined for the amino acid composition of metal-binding sites, yielding indications on which residues are important to modulate the properties and determine the specific functions of a metal. In a case study conducted on non-heme iron sites,65 we illustrated all the above applications of a library of metal-binding motifs constructed by our method and, in particular, we obtained functional hints for 14 of 15 iron proteins with unknown function, showing that local similarity searches based on these libraries can be advantageously used for function prediction of unannotated metalloproteins.
The fundamental concepts underlying our approach are common to the large majority of tools that have been developed to recognize metal-binding motifs in protein structures. Generally speaking, all such methods extract information on known metal-binding motifs from the PDB, encode this information in some form, and then scan protein structures for matches to the motifs. What differs among them is the type of information used to describe the metal-binding motifs and, consequently, the way protein structures are analysed for matches. The three-dimensional templates used in our approach, which are generated directly from PDB coordinates and can be compared to query structures by structural alignment, represent a straightforward, intuitive way to describe structural motifs. Analogous templates, which however contain only the residues that coordinate the metal, are employed to detect metal-binding sites in structure-based function prediction servers such as PDBSiteScan73 and ProFunc,74 and have also been used to investigate the evolution of zinc-binding and calcium-binding sites involved in protein structure stabilization.75
In other approaches, the structures of metal-binding sites are not specified by spatial coordinates, but are described abstractly in terms of various selected parameters, which collectively make up the “fingerprint” of a site. Relatively simple descriptions consider only geometrical parameters such as combinations of distances between atoms of metal-coordinating residues,76 whereas more complex descriptions also take into account features like secondary structure and solvent accessibility, as well as information not obtainable from a single structure such as the degree of conservation of metal-binding residues across homologues.77,78 A singular description, which has been used to identify magnesium-binding sites, involves the conversion of structures into sequences of letters drawn from a so-called structural alphabet based on the conformation of five-residue protein fragments.79 In a study aimed at detecting zinc-binding sites, no less than 43 parameters describing the physicochemical properties of the protein in six concentric shells around the metal were considered.80 These approaches typically exploit machine learning techniques because, as mentioned in the previous section, such techniques are able to infer complex rules from the analysis of many different parameters: in a nutshell, the combinations of parameters that characterize metal-binding sites are derived from the analysis of known metalloproteins, and query structures are then examined for the occurrence of such combinations. A conspicuous exception in this scenario is represented by the Fold-X method, in which the information derived from known metalloproteins is translated into empirical force field parameters, and metal-binding sites are detected by calculating energetically favourable positions for metals in protein structures.81 Some of the above approaches have also been shown to be effective when applied to structures obtained from homology modelling, suggesting that they may be used on a genome -wide scale.80,82
Databases lie at the heart of modern genome -based biology, forming the infrastructure necessary for the collection, maintenance and provision of biological information. They are indispensable resources in an increasingly information-rich science, in which formidable amounts of data coming from different research fields need to be catalogued to be analysed and interpreted. Also, databases represent a long-term protection of the research efforts and investments made to generate the data, which due to their size can no longer be published in a conventional sense. Despite the large number of web-accessible data resources created over the last years, public databases devoted to metalloproteins have been surprisingly scarce.88 We have recently developed one such database, called Metal-MACiE, which aims to organize the available knowledge on the catalytic mechanisms of metalloenzymes .89 It presents a detailed description of the properties and the roles of metals involved in enzyme reactions, and has provided a basis for analysing the specific functions of different metals in relation to their individual chemical properties and availability in the environment.90 In the past, the MDB91 and the PROMISE92 databases were created with the aim of providing a comprehensive resource for metalloproteins. They were intended to serve as complementary databases, in that MDB provided quantitative information on the structural and chemical features of metal-binding sites retrieved from PDB structures, and PROMISE provided qualitative information on metalloproteins in the form of descriptive annotations derived from the literature. Unfortunately, both of them were discontinued some years ago. At present, a functionality similar to that of MDB is available through the PDBeMotif (formerly called MSDmotif)93 resource at the PDBe database (formerly called MSD),94 which can be used to interactively analyse several types of protein sites and motifs with known structure. However, PDBeMotif, as well as its predecessor MSDsite,95 is not a tool designed to investigate metalloproteins, and its use for this purpose is limited by its complexity and the lack of specific information provided on metal-binding sites.
In light of the above, it appears that a comprehensive, up-to-date database which collects, organizes and makes easily available the current knowledge on metalloproteins does not presently exist. In our opinion, the key to achieving the ambitious goal of gathering, on a single platform, the exceptional variety of metalloproteins is to design a database architecture based on a rigorous classification of metal-binding sites. A similar concept was put forward by the developers of the COMe ontology (i.e., a formal definition of concepts of a given area of knowledge such as bioinorganic chemistry, described in a standardized form),96 who proposed to catalogue metalloproteins by describing their metal-binding motifs according to a standard formalism.97 We have suggested that the representation of metal-binding sites by way of the three-dimensional templates discussed in the previous section can provide a useful basis for their classification, because it allows them to be compared in a systematic and largely automated fashion.65 The structure-based categorization of metal-binding sites could provide the fundamental framework for the database, in which other kinds of information such as functional and sequence information could be integrated as sketched in Fig. 5. This information should also be endowed with a defined ontology (see above), so as to design an efficient and flexible query system, thus optimizing database access, and to automate as much as possible the generation and processing of data, thus facilitating database maintenance and updates. Nevertheless, this task can be far from trivial for certain kinds of data: in the case of functional information, for example, there is a lot of active research going on to define controlled vocabularies such as the Gene Ontology (GO) system.98
![]() | ||
Fig. 5 Scheme depicting a possible architecture for a comprehensive database of metalloproteins. At the core of the database are three-dimensional templates representing all known metal-binding sites, which are classified based on their structural features (see also Fig. 3). Metal-binding sites and the metalloproteins containing them are annotated with functional descriptions derived from literature analysis and following the GO terminology as much as possible. Sequence information may include the amino acid sequences of the metalloproteins (taken from UniProt), the Pfam domains identified within these sequences, and the sequences of related metalloproteins predicted using the methods described in the first section (e.g., containing the same Pfam domain, or the same metal-binding pattern). In principle, any other kind of information may also be incorporated, such as information regarding the sub-cellular localization of metalloproteins (e.g. in mitochondria, derived from the MitoP2 database).133 |
A comprehensive resource on metalloproteins like that envisaged above would provide an unprecedented, unified vision of these systems, and would thus constitute an important reference for many research areas also outside bioinorganic chemistry. In the context of today’s systems biology approaches, however, the knowledge contained in such a database would be fully exploited only by connecting it to that contained in other, diverse resources. At a minimum, connections may be made by suitable cross-references to other databases such as the PDB and Pfam, which in turn could be linked to the metalloprotein database. Ideally, the information on metalloproteins would be incorporated into integrated resources such as InterPro and Integr8, where several heterogeneous data are combined together and can be accessed through a single interface.13
Interactome data are typically represented in the form of network diagrams, where any two interacting proteins are represented as two nodes connected by an edge. This highly abstract representation allows one to examine networks by using concepts from graph theory, thus obtaining important information on their topological properties.102,114 For example, graph analysis can be used to identify densely connected sub-networks which are likely to represent proteins involved in the same biological process.115–117 However, such a simplified formal description of interactions has only a limited relationship with physical reality, and appears to be inadequate for obtaining meaningful insight into cellular functioning. Furthermore, currently available interactome data are still largely incomplete and affected by high error rates, thus causing a need for effective means to discriminate between true positive and false positive interactions.118 These issues can be addressed by integrating various types of biological information within the framework provided by the interaction network. For example, protein structure is often regarded as crucial information to validate interactions, determining which interactions are compatible with each other and which ones are mutually exclusive, as well as to estimate quantitative interaction parameters such as affinity constants.119–122 Also, information on protein sub-cellular localization and specificity of expression can be used to single out interactions between proteins which, despite having the potential to interact, are never found in the same cellular compartment, or even (in multi-cellular organisms) in the same cell, or tissue. Incorporating the information associated with the metal-binding properties of proteins, therefore, can be very useful to characterize individual protein–protein interactions in networks (see Fig. 6 for example). In particular, this knowledge may be essential to assess metal-mediated interactions, which are transient protein–protein interactions occurring only in the presence of the metal (e.g., between the copper-binding Atx1 and Ccc2 proteins)123 and which may be overlooked in large-scale approaches for interactome mapping. On the other hand, positioning metal-binding proteins in an interaction network can help to assess the metal-binding properties that were predicted for those proteins. Computational predictions indicate in fact the potential capability of a protein to bind a given metal, whereas the binding of specific metals by proteinsin vivo may involve elaborate mechanisms relying on controlled metal delivery mediated by metal-specific carriers, and protein compartmentalization.124,125 Therefore, a valuable indication as to the metal actually bound in vivo by a predicted metalloprotein can be obtained from the knowledge of the cellular context in which it acts.
![]() | ||
Fig. 6 Example of protein–protein interaction network analysis supported by integrating the information on the metal-binding properties of proteins. In this hypothetical case, proteins binding a given metal (shown as grey boxes) form a highly connected sub-network (sub-network A) which is linked by protein P9 to another sub-network (sub-network B) comprising proteins that do not bind that metal (shown as white boxes). This suggests that P9 is a multifunctional protein playing a role in both the cellular management of the metal (the function associated with sub-network A) and a different cellular process (associated with sub-network B). Also, sub-network A contains a protein (protein P3) that was not predicted to bind the metal. Protein P3 is thus an interesting target for characterization, which may reveal a novel metalloprotein. |
A major goal of systems biology approaches is to translate biological networks into mathematical models, linking the behaviour of a system to the whole of the interactions between its molecular components.10,126 Mathematical modelling provides an essential framework to simulate the complex, space- and time-dependent processes taking place in biological systems, and requires a comprehensive, quantitative description of the networks underlying the modelled processes. It is becoming increasingly clear118 that such a comprehensive description, which also involves the experimental determination of several parameters such as kinetic constants and diffusion coefficients, can currently be achieved only for small-scale biological systems such as certain signalling pathways127 and cell motility machineries.128 In this framework, the integration of metalloprotein information into interactome networks can be used to isolate sub-networks relevant to specific metals, highlighting the sets of interacting proteins responsible for the management and the utilization of those metals. Modelling and experimental efforts could be then focused on these sets, setting up an iterative cycle in which predictions formulated by the model are tested by experiments, whose results in turn allow the refinement of the model. Bioinorganic chemistry and bioinformatics would thus support each other in a synergistic fashion, towards achieving the ultimate goal of describing the mechanisms by which metals are framed in living organisms.
Footnote |
† Electronic supplementary information (ESI) available: Example protocol for the identification of zinc proteins based on our method. See DOI: 10.1039/b912156k |
This journal is © The Royal Society of Chemistry 2010 |