The year 2020 in natural product bioinformatics: an overview of the latest tools and databases

Bioinformatic approaches to document and analyse chemical structures, biosynthetic gene clusters and analytical data play an important role in the study of natural products. Every year, such a large number of new algorithms, tools and databases are released, that it is difficult to keep track of all the latest developments. The aim of this short article is to provide a concise overview of and reference to the major tools, methods and databases that have been released in the past year.


Introduction
The study of natural products involves various different types of data, including structural, genomic, metabolomic and spectroscopic data. All these types of data require computational algorithms and resources to effectively process, analyze and contextualize them. The past decade has seen an acceleration in the development of new tools and databases that are relevant to natural product researchers. Here, we provide a concise overview of the latest tools and databases for the analysis of natural product chemical structures, the identication and annotation of biosynthetic gene clusters, and the analysis of natural product diversity in metabolomic datasets. The ESI † includes a table listing all tools and databases discussed here.

Chemical structure databases
Following the release of the Natural Product Atlas in late 2019, 1 several specialized databases for natural products from specic organisms or compound classes were released. These included a new version 3.0 of the Streptome-DB, 2 which included 2500 new natural product structures from streptomycetes, as well as CyanoMetDB, 3 a database covering 2000 natural product structures from cyanobacteria. From a compound class-guided perspective, MacrolactoneDB 4 appeared, which includes 14 000 macrolactone structures and their bioactivity information. The NORINE database for nonribosomal peptides also saw a new release, 5 which included integration of the recently published retrobiosynthetic algorithm rBAN 6 to automatically identify the constituent monomers and other building blocks of these important natural products. While natural product structures remain distributed across many different databases, the COlleCtion of Open NatUral producTs (COCONUT) 7 combines structures from a wide range of open-access databases into a single resource.

Cheminformatic tools
To utilize and leverage such structural data, a number of relevant new cheminformatic tools have appeared. NPClassier 8 is a deep-learning-based algorithm that can help automatically classify sets of structures (e.g., taken from a database or obtained from a set of library matches in a mass-spectrometric dataset) into classes and superclasses; thus, it can automatically identify whether molecules are, e.g., terpenoids, polyketides or peptides. To map chemical space in more detail, and to identify structural similarities between molecules, molecular ngerprints are oen used. In this area, two new ngerprint technologies, NC-MFP 9 and MAP4, 10 were presented that showed promising performance in explaining biological activities or differentiating closely related metabolites, respectively. Finally, to help seed compound structure databases, a new method, DECIMER, 11 was developed to recognise chemical structures from images in journal papers.

Identifying biosynthetic gene clusters
Genome mining is playing a more and more important role in natural product discovery. A range of well-known methods is available to identify biosynthetic gene clusters (BGCs) in genomes. Several of these were updated this year, such as PRISM4 (ref. 12) (see discussion under 'Predicting chemical structures'), as well as SeMPI version 2.0, 13 which includes matching of predicted BGC products to natural product databases. Several new approaches were added to this set of tools: EvoMining 14 is able to look for bacterial biosynthetic pathways that show no or only limited sequence similarity to known biosynthetic systems, by identifying paralogues of primary metabolic enzymes that have undergone accelerated evolution towards a secondary metabolic function. Aimed at fungi, CO-OCCUR 15 provides a new way of identifying BGCs based on shared syntenic relationships between biosynthetic genes. Another fungal BGC identication tool, TOUCAN, 16 was also released. A particularly challenging type of BGCs to computationally identify are those encoding the biosynthesis of Ribosomally synthesized and Posttranslationally modied Peptides (RiPPs), because of the apparent large diversity of unknown RiPP classes for which rule-based detection is not possible (as the required knowledge to design such rules is not yet available). 17 Several new algorithms were release this year that utilize machine learning and pattern-recognition approaches to this end, including DeepRiPP, 18 RRE-nder 19 and decRiPPter, 20 on top of other approaches like RiPPER 21 that had been published last year, some of these aiming at the identication of novel RiPP classes. Another tool to identify RiPP biosynthetic pathways, RODEO, was extended with capabilities to explicitly identify linaridins. 22

Charting biosynthetic gene cluster diversity
To be able to cope with datasets covering thousands or even hundreds of thousands of genomes, new algorithms were released to chart the diversity of BGCs in genomic data. BiG-SCAPE and CORASON 23 enable automated sequence similarity networking and reconstruction of BGC phylogenies to facilitate the exploration of thousands of BGCs from diverse organisms. More recently, BiG-SLICE 24 was released, which scales up this principle by allowing the grouping of millions of BGCs into gene cluster families; the BiG-FAM database 25 makes these gene cluster families easily searchable for the scientic communities, and allows assignment of BGCs to such families directly from antiSMASH results. The new cblaster tool 26 provides a quick way to perform similarity searches of BGCs by remote querying the NCBI web services, and to enable visual gene cluster comparisons between selected BGCs, the related clinker 27 tool provides a highly user-friendly method.

Biosynthetic gene cluster databases
Several databases of biosynthetic gene clusters were also updated or released this year. The MIBiG repository for experimentally characterized biosynthetic gene clusters saw a second release, 28 in which 851 new BGCs were added and the database was made searchable online. Two databases for computationally predicted BGCs, antiSMASH-DB 29 and IMG-ABC, 30 were also updated with new features, including extension with fungal data and fully refreshed contents, respectively. A new atlas of fungal BGCs from 1000 fungal genomes, called Prospect, was also released, which includes gene cluster family assignments for these gene clusters. 31 Finally, databases with curated sets of high-quality genomes, such as the ActDES database for actinomycetes 32 released this year, will make it easier to navigate highquality data when navigating biosynthetic potential of various taxa.

Target-based genome mining
Finding the needle in the haystack within these giant datasets is not trivial. Target-based genome mining approaches make it possible to identify BGCs encoding the production of natural products with a biological activity of interest, such as antibiotics. An updated version of the ARTS pipeline 33 now enables identication of potential self-resistance genes in BGCs from across the tree of life, including metagenomic data. A similar approach, specically dedicated to polyketide BGCs, was also released by others. 34 Additionally, a new study shows that transporter-encoding genes can also be used as functional markers for target-based genome mining. 35

Predicting chemical structures
The ability to (partially) predict chemical structures of the products of BGCs is key for identifying potential chemical novelty during the genome mining process, as well as for matching BGCs to metabolites from analytical data. Several new tools have been developed that can aid in such efforts. The new version 4 of PRISM 12 has improved chemical structure prediction capabilities, which made it possible to train machinelearning models to predict the biological activity of BGC products based on these structure predictions. Two new algorithms, DDAP 36 and PKSpop, 37 provide improved prediction of docking domain interactions between polyketide synthases, which determine the order of these enzymes in the assembly lines, and thus also the order of the incorporated monomers in their nal products. To go from monomers towards nal products, another group published a machine-learning method that predicts macrocyclization patterns for both polyketides and nonribosomal peptides. 38  There, his group develops computational methodologies to unravel natural product biosynthesis using omics data, and applies these methods to the study of molecular interactions in microbiomes. megasynthases, the AdenylPred 39 algorithm presents a new method to predict catalytic functions and substrate specicities for the whole superfamily of adenylate-forming enzymes, which include not only nonribosomal peptide synthetase adenylation domains, but also e.g. fatty-acyl CoA-ligases and beta-lactone synthetases.

Analysing natural product NMR data
Elucidating chemical structures is arguably worth more than predicting them. NMR data play a crucial role in this, but algorithms to automate the analysis of such data have been lagging. In the past year, some exciting breakthroughs were published in this area. The SMART 2.0 algorithm 40 is a convolutional neural network-based approach that automatically generates structure hypotheses from 1 H-13 C-HSQC spectra. Other methods that aid in interpreting NMR spectra also appeared, including a classier that assigns molecules to a natural product class based on 13 C spectra 41 and the DP4-AI machine learning algorithm that aims to automate structure assignment from NMR spectra. 42 For the analysis of natural product mixtures (extracts or fractions), MixONat 43 provides a new tool for automated dereplication.

Developments in mass spectrometry data analysis
Within the realm of analytical techniques, the analysis of tandem mass-spectrometric (MS/MS) data has been revolutionized in recent years, and a range of groundbreaking new methods have been added in 2020. The ZODIAC algorithm 44 uses Gibbs sampling and Baysian statistics to accurately predict molecular formulas for a compound by considering joint fragments and losses in fragmentation trees, CANOPUS 45 uses fragmentation spectra to automatically classify molecules into 2500 classes with deep learning, and Retip 46 provides a new way of predicting metabolite retention times from chemical structures. MetFID 47 provides a new neural network-based algorithm to predict compound ngerprints from MS/MS spectra, which aids the structural annotation of the underlying metabolites. With MASST, 48 a 'BLAST for molecules' was introduced that facilitates rapid similarity searches for MS/MS spectra and allows users to assess in which publicly available samples a metabolite of interest is present. Several new methods for and improvements to molecular networking technologies were also put forward, including Spec2Vec, 49 which uses natural language processing to identify similarities in a way that takes into account patterns observed across large datasets. Additionally, feature-based molecular networking (FBMN) 50 was introduced in the Global Natural Products Social Molecular Networking (GNPS) infrastructure, which incorporates information from ion mobility separation. Additionally, the GNPS framework was also improved to facilitate the analysis of gas chromatography-mass spectrometry data, 51 and the ReDU system makes it possible to straightforwardly re-analyze public MS/MS datasets by identifying them through a controlled vocabulary. 52 As an alternative to molecular networking, Qemistree facilitates analysing chemical diversity from MS data using hierarchical clustering. 53 Finally, several developments in databases of mass spectra are notable: BMDMS-NP 54 provides a comprehensive library of almost 3000 ESI-MS/MS spectra for plant natural products, while METLIN provides molecular standards for 850 000 metabolites, including many natural products. 55 Linking MS data to structures and gene clusters Improvements have also been made for the analysis of specic types of molecules, such as peptides: with CycloNovo, 56 a new soware was released that enables high-throughput de novo sequencing of peptides from MS/MS data. In addition, NRPro 57 automatically annotates and dereplicates peptidic natural products based on their tandem mass spectra. Such peptides can be linked to BGCs with increasing effectiveness, with methods such as MetaMiner, 58 which matches genomically predicted peptides with their possible modications to the monomers inferred from MS data. Connecting MS data to BGCs can also be done based on absence/presence correlations of molecules and gene clusters across strains, and the NPLinker framework provides the rst full-edged soware that automates this, and also introduces a new scoring function. 59

Conclusions
Computational methods are becoming more and more ingrained in the day-to-day science of natural product researchers, and the speed with which new methods are introduced reects this. Even outside the familiar realms of natural product bioinformatics outlined above, exciting new approaches are being introduced, including a new deep learning approach to predict antibiotic activities from chemical structures 60 and computational approach to automatically plan efficient routes toward the total synthesis of natural products. 61 The year 2021 is likely to again provide a similar range of new approaches, and navigating the diversity of available algorithms will become an increasingly important skill for those who are trained in natural product science.

Conflicts of interest
M. H. M. is a co-founder of Design Pharmaceuticals and a member of the scientic advisory board of Hexagon Bio.