Open Access Article
Joseph
Redshaw
a,
Darren S. J.
Ting
bef,
Alex
Brown
c,
Jonathan D.
Hirst
*a and
Thomas
Gärtner
d
aSchool of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, UK. E-mail: jonathan.hirst@nottingham.ac.uk
bAcademic Ophthalmology, School of Medicine, University of Nottingham, Nottingham, NG7 2UH, UK
cArtificial Intelligence and Machine Learning, GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, SG1 2NY, UK
dMachine Learning Group, TU Wien Informatics, Vienna, Austria
eAcademic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
fBirmingham and Midland Eye Centre, Birmingham, UK
First published on 27th February 2023
Antimicrobial peptides (AMPs) represent a potential solution to the growing problem of antimicrobial resistance, yet their identification through wet-lab experiments is a costly and time-consuming process. Accurate computational predictions would allow rapid in silico screening of candidate AMPs, thereby accelerating the discovery process. Kernel methods are a class of machine learning algorithms that utilise a kernel function to transform input data into a new representation. When appropriately normalised, the kernel function can be regarded as a notion of similarity between instances. However, many expressive notions of similarity are not valid kernel functions, meaning they cannot be used with standard kernel methods such as the support-vector machine (SVM). The Kreĭn-SVM represents generalisation of the standard SVM that admits a much larger class of similarity functions. In this study, we propose and develop Kreĭn-SVM models for AMP classification and prediction by employing the Levenshtein distance and local alignment score as sequence similarity functions. Utilising two datasets from the literature, each containing more than 3000 peptides, we train models to predict general antimicrobial activity. Our best models achieve an AUC of 0.967 and 0.863 on the test sets of each respective dataset, outperforming the in-house and literature baselines in both cases. We also curate a dataset of experimentally validated peptides, measured against Staphylococcus aureus and Pseudomonas aeruginosa, in order to evaluate the applicability of our methodology in predicting microbe-specific activity. In this case, our best models achieve an AUC of 0.982 and 0.891, respectively. Models to predict both general and microbe-specific activities are made available as web applications.
AMPs, also known as host defense peptides, are a class of evolutionary conserved molecules that form an important component in the innate immune system.4–6 These molecules are usually made of 12 to 50 amino acid residues, and they typically possess certain properties, including cationicity, 30–50% hydrophobicity, and amphiphilicity. They exhibit good antimicrobial activity against a broad range of bacteria, viruses, fungi, and parasites. In addition, they have an inherent low risk of developing antimicrobial resistance (AMR), largely attributed to their underlying rapid membrane permeabilising activity.4,7,8 Such broad-spectrum and rapid antimicrobial activity has prompted researchers to consider AMPs as a potential remedy to the growing problem of AMR, which is a major global health threat.9,10 Nonetheless, there has so far been a lack of success in translating AMP-based therapy to clinical use, due to challenges such as complex structure–activity relationship (SAR), drug toxicity, instability in host and infective environment, and low financial incentives.11,12 Owing to the complex SAR and the costly and time-consuming process of wet-lab experiments associated with AMP investigations, many researchers have proposed computational approaches, including molecular dynamics (MD) simulations and machine learning (ML) algorithms, to accelerate the discovery and development of potential AMPs for clinical use.13–19
Several studies have highlighted the promise of ML algorithms in predicting the antimicrobial activity, dissecting the complex SAR, and informing the drug design of AMPs.13–15 A wide range of ML algorithms have been utilised, including random forests,20 support vector machines (SVMs)20–24 and artificial neural networks.20–22,25,26 Many of these algorithms are used in combination with a carefully selected set of peptide features, which can be divided into two categories: compositional and physicochemical. The amino acid composition is the simplest example of a compositional feature, which is a vector containing counts of each amino acid in a given peptide. There are various extensions, such as the reduced amino acid composition27 and the pseudo amino acid composition.28 When computing the reduced amino acid composition, a peptide is represented in a reduced alphabet in which similar amino acids are grouped together. The pseudo amino acid composition accounts for composition as well as sequence-order information, as this is not considered in the standard amino acid composition. The set of physicochemical features include peptide properties such as the charge, hydrophobicity and isoelectric point.20,24,29 These features are typically average values of the respective properties calculated over the length of the peptide.
Classical sequence alignment algorithms, such as the Smith–Waterman30 and Needleman–Wunsch31 algorithms, are computationally intensive and do not scale well to large problems. Many papers have advocated the use of alignment-free methods to determine sequence similarity.19,32–35 The success of these endeavours notwithstanding, sequence alignment functions are effective notions of biological-sequence similarity that can reflect ancestral, structural or functional similarity and therefore should not be overlooked. Several studies have utilised sequence alignment functions for AMP prediction. For example, Wang et al.36 and Ng et al.37 utilised the BLAST algorithm38 in a classification model by comparing the BLAST similarity scores of a query peptide to all those in the training set. Whilst these approaches led to accurate models, the BLAST algorithm is a heuristic method that finds only approximate optimal alignments. This approximation leads to generally faster results than what could be obtained by the Smith–Waterman algorithm, and it is one of the main reasons practitioners choose to use it. However, on the relatively small datasets in the aforementioned studies, it is interesting to consider whether the same approaches using the optimal alignment score would improve the models.
The SVM is a well-known ML algorithm for classification and can incorporate a kernel function in order to learn non-linear classification boundaries. The kernel function greatly influences the performance of the resulting classification model. When appropriately normalised, a kernel function can be regarded as a similarity function. A useful kernel function should produce similarities that are relevant to the problem. Many expressive notions of similarity are not valid kernel functions,39–42 in that they are indefinite, meaning they cannot be used with an SVM. Recent developments have now alleviated this problem, facilitating a much larger class of similarity functions to be used in conjunction with an SVM. Loosli et al.1 present an algorithm for learning an SVM with indefinite kernels. Their approach relies on a method of stabilisation, meaning there is no guarantee of global optimality. On the other hand, the Kreĭn-SVM3 is an algorithm for learning an SVM with indefinite kernels that is guaranteed to find a globally optimal solution.
In this work, we utilised the Kreĭn-SVM algorithm to assess the effectiveness of sequence alignment functions for AMP classification. We performed an empirical comparison of both the local alignment score and Levenshtein distance43,44 on two AMP datasets from the literature. Furthermore, we tested the ability of our approach to detect activity against specific species on a dataset of experimentally validated peptides measured against both Staphylococcus aureus ATCC 29213 and Pseudomonas aeruginosa ATCC 27853, henceforth referred to as S. aureus ATCC 29213 and P. aeruginosa ATCC 27853. Our trained models are made available as web applications at http://comp.chem.nottingham.ac.uk/KreinAMP, for the prediction of both general and species-specific activities.
The decision surface associated with a hyperplane is inherently linear, which can be restrictive when the two classes are not linearly separable. This issue is mitigated through the use of a kernel function, which implicitly maps the instances into a new space. The space in which the instances are mapped is known as a reproducing kernel Hilbert space (RKHS), and every kernel function is uniquely associated to a RKHS. When incorporating a kernel function, the SVM finds the maximum-margin hyperplane in the associated RKHS and this can correspond to a non-linear decision surface in the original space. This surface allows greater flexibility when separating the two classes but also increases the possibility of overfitting. Perfectly separating a noisy dataset is often indicative of overfitting. The Soft Margin SVM helps alleviate this problem by allowing some instances to reside on the incorrect side of the hyperplane. Eqn (1) presents the optimisation problem that is solved when training a Soft Margin SVM with L2 loss, commonly known as an L2-SVM. We opted for an L2-SVM as greater penalisation is placed on instances residing on the incorrect side of the hyperplane.
![]() | (1) |
denotes the RKHS from which a solution is found. The solution is a function
and a vector of slack variables ξ which minimise the objective function, subject to the constraints. The constraint yif(xi) ≥ 1 − ξi imposes that the i-th instance lies on the correct side of the hyperplane and that its distance to the hyperplane is greater than or equal to the margin. An instance xj which does not satisfy this constraint contributes a value of ξj2 to the objective function, where ξj is the distance between xj and the margin. The
term in the objective function is the total contribution of all instances which do not satisfy the constraints. The
term is the norm of the considered function and acts as a measure of complexity. The hyperparameter λ can be tuned to provide a balance between the number of instances that violate the constraints and the complexity of the considered function.
![]() | (2) |
the associated RKKS. Similarly to the SVM, the solution is a function
and a vector of slack variables ξ which minimise the objective function. The first term in the objective function, as well as the constraints, have the same interpretation as in eqn (1). The only notable difference to the SVM is the method of regularisation. Any RKKS
can be expressed as a direct sum of the form
are RKHSs. This means that any function
can be decomposed as f = f+ ⊕ f−, where
. Hence, regularisation in eqn (2) is performed by separately regularising each component of the decomposition. The hyperparameters λ± can be tuned to provide a balance between the number of instances that violate the constraints and the regularisation of each decomposition component.
(2) |s′| = |t′| = l, such that max(n, m) ≤ l ≤ m + n,
(3) The subsequence of s′ obtained by removing all gap characters is equal to s,
(4) The subsequence of t′ obtained by removing all gap characters is equal to t,
Definition 2.1 provides a formal definition of global alignment. Whilst many possible alignments exist for two strings, the goal of sequence alignment is to find an alignment that optimises some criterion. A scoring function, presented in Definition 2.2, can be used to quantify the “appropriateness” of an alignment. An optimal global alignment is then one which is optimal with respect to a given scoring function, as shown in Definition 2.3.
be a function defined over the elements of Σg. We say p is a similarity scoring function over Σg if, for all x, y ∈ Σ, we have
(1) p(x, x) > 0,
(2) p(x, x) > p(x, y),
(3) p(x, y) = p(y, x),
(4) p(x, “−”) ≤ 0,
(5) p(“−”, “−”) = −∞.
Similarly, we say p is a distance scoring function over Σg if, for all x, y, z ∈ Σ, we have
(1) p(x, x) = 0,
(2) p(x, y) > 0,
(3) p(x, y) = p(y, x),
(4) p(x, “−”) > 0,
(5) p(x, z) ≤ p(x, y) + p(y, z),
(6) p(“−”, “−”) = ∞.
be a scoring function over Σg and
be the space of all valid alignments of s and t. The score
of α(s, t) with respect to the scoring function p is defined asSince an alignment is optimal with respect to a given scoring function, it is natural to consider which scoring function to use in order to obtain the most meaningful alignments. In the context of biological sequences, researchers have been considering this problem for many years. A number of families of scoring matrices have been designed to encode useful notions of similarity. In this work, we only considered the BLOSUM62 scoring matrix,46 as it is a standard choice when performing sequence alignment.
In order to define more complex transformations, one can consider the consecutive application of a sequence of elementary edit operations. Of special interest to the Levenshtein distance are those sequences of operations that transform one string into another. Such a sequence is known as an edit path and its length is defined as the number of operations in the sequence. The Levenshtein distance between two strings is defined as the length of the minimum length edit path, as seen in Definition 2.5.
the space of all edit paths from s to t. The Levenshtein distance dL(s, t) between s and t is defined asDefinition 2.5 shares some interesting similarities with Definition 2.3. Both problems solve a combinatorial optimisation problem and, indeed, the Levenshtein distance can be realised as a special case of global alignment. More specifically, consider the distance scoring function p: Σg × Σg → {0, 1} defined as
For two strings s ∈ Σn and t ∈ Σm, let their optimal global alignment with respect to p be equal to α*(s, t) = {s′, t′}. Define the set U as
Then the score
of α*(s, t) with respect to p is equal to the cardinality of U. This is exactly equal to the Levenshtein distance between s and t.
be a similarity scoring function. For a string s ∈ Σn, let
be the space of all index sets such that for any
, s[Is] is a valid substring of s. Similarly, for a string t ∈ Σm, let
be the space of all index sets such that for any
, t[It] is a valid substring of t. For any
and
, denote by α*(s[Is], s[It]) the optimal global alignment of s[Is] and t[It] with respect to p. The optimal local alignment
of s and t with respect to p is defined as![]() | (3) |
![]() | (4) |
The AUC is defined as the probability that a classifier will score a randomly chosen positive instance higher than a randomly chosen negative instance.52 In order to allow for a fair comparison, all models used the same training and test splits. The optimal hyperparameters were selected by performing an exhaustive grid search over the training set, using 10-fold cross validation. The λ hyperparameter of the SVM algorithm, as well as the λ+ and λ− hyperparameters of the Kreĭn-SVM algorithm, were selected from {0.01, 0.1, 1, 10, 100}. The Levenshtein distance and amino acid composition kernel have no hyperparameters to control; we used the default values for the hyperparameters of the local alignment score. The gapped k-mer kernel has two hyperparameters g and m and is quite susceptible to their values. The optimal value of g was selected from {1, 2, 3, 4, 5} and the optimal value of m was selected from {1, 2, 3, …, 10}. It is required that g > m, so only valid combinations of the two were considered. In our nested cross-validation experiments, we used 10 inner and 10 outer folds and the reported results are averaged over the outer fold test sets.
:
50 ratio of AMPs to non-AMPs, allowing us to avoid issues that result from class imbalance. Associated to each dataset is a specific test set, and reporting results on this set allows comparison with the authors' proposed models. Despite being of similar size, one major differentiating factor between the two datasets is the length of peptides. Fig. 1 displays the empirical distribution of peptide lengths for both datasets, partitioned by peptide classification. In both cases, the distributions corresponding to AMPs and non-AMPs are very similar. However, the distributions across datasets are clearly very different. The Kolmogorov–Smirnov two-sample test53 provides evidence to reject the null hypothesis that the peptide length distributions of the AMPScan and DeepAMP datasets are identical. DeepAMP contains generally shorter peptides than AMPScan. Indeed, the DeepAMP dataset was curated since short-length AMPs have been shown to exhibit enhanced activity, lower toxicity and higher stability as opposed to their longer counterparts.54,55 More importantly, synthesis is cheaper for the short AMPs than the full-length AMPs, which increases the potential for clinical translation and commercialisation.12Fig. 2 displays the empirical amino acid distributions for both datasets, indicating their similarity.
Vishnepolsky et al.56 have shown that, given an appropriate training dataset, predictive models of peptide activity against specific species can be constructed. This involves training a separate model for each species of interest, which, in this case, was the DBSCAN algorithm.57 Their model predicting activity against E. coli ATCC 25922 achieved a balanced accuracy of 0.79, which was greater than a number of common AMP prediction tools.56 Furthermore, models to predict activity against S. aureus ATCC 25923 and P. aeruginosa ATCC 27853 were made publicly available as web-tools.
We follow the methodology of Vishnepolsky et al. to construct useful training datasets for our problem. We utilised the Database of Antimicrobial Activity and Structure of Peptides (DBAASP) as a source of data. DBAASP contains peptide activity measurements against a wide-range of species,58 including those of interest to us. We extracted from DBAASP all peptides with activity measured against S. aureus ATCC 29213, S. aureus ATCC 25923 or P. aeruginosa ATCC 27853 subject to the following conditions: (i) peptide length in the range [6, 18], (ii) without intrachain bonds, (iii) without non-standard amino acids and (iv) MIC measured in μg mL−1 or μM. Condition (i) was imposed as that is the range of peptide lengths in our external test set. Conditions (ii) and (iii) were imposed following the recommendation of the Vishnepolsky et al. and condition (iv) was imposed as conversion from μM to μg mL−1 is possible by estimating the molecular weight of a given sequence. Since no web-tool to predict activity against S. aureus ATCC 29213 was available, we couldn't directly compare our results. Instead, we collected data for peptides active against S. aureus ATCC 25923. This allowed us to compare our models with the state of the art provided by Vishnepolsky et al.
Three separate datasets of peptides with activity measured against S. aureus ATCC 29213, S. aureus ATCC 25923 and P. aeruginosa ATCC 27853 were created using the data collected from DBAASP. We refer to these datasets as SA29213, SA25923, and PA27853, respectively. The peptides in the respective datasets were labelled according to their activity against the specific strain. Each dataset is constructed from highly active peptides (MIC ≤25 μg mL−1) and inactive peptides (MIC ≥100 μg mL−1). A peptide with 25 μg mL−1 < MIC <100 μg mL−1 would not be included in our training dataset. This large interval allows us to account for experimental errors, which in turn increases the confidence in our class labels. In the case that a peptide was associated to multiple activity measurements, the median value was taken to represent its activity. As shown in Table 1, the three training datasets are all relatively small and contain slightly more active peptides than inactive peptides.
| Dataset | Size | Class ratio |
|---|---|---|
| SA29213 | 463 | 0.644 |
| SA25923 | 808 | 0.646 |
| PA27853 | 686 | 0.547 |
| Model | AMPScan | DeepAMP | ||||
|---|---|---|---|---|---|---|
| Accuracy | AUC | MCC | Accuracy | AUC | MCC | |
| LA-KSVM | 0.920 (0.017) | 0.969 (0.006) | 0.842 (0.033) | 0.760 (0.025) | 0.821 (0.028) | 0.522 (0.051) |
| LEV-KSVM | 0.910 (0.021) | 0.966 (0.010) | 0.821 (0.042) | 0.756 (0.032) | 0.819 (0.029) | 0.513 (0.063) |
| GKM-SVM | 0.899 (0.015) | 0.957 (0.007) | 0.799 (0.030) | 0.751 (0.032) | 0.817 (0.029) | 0.506 (0.066) |
| AAC-SVM | 0.865 (0.023) | 0.930 (0.009) | 0.732 (0.044) | 0.723 (0.031) | 0.784 (0.035) | 0.447 (0.061) |
| Model | AMPScan | DeepAMP | ||||
|---|---|---|---|---|---|---|
| Accuracy | AUC | MCC | Accuracy | AUC | MCC | |
| LA-KSVM | 0.911 | 0.967 | 0.823 | 0.761 | 0.863 | 0.523 |
| LEV-KSVM | 0.904 | 0.960 | 0.809 | 0.798 | 0.860 | 0.596 |
| GKM-SVM | 0.900 | 0.954 | 0.801 | 0.782 | 0.838 | 0.564 |
| AAC-SVM | 0.870 | 0.929 | 0.742 | 0.771 | 0.853 | 0.543 |
| Literature | 0.910 | 0.965 | 0.820 | 0.771 | 0.853 | 0.543 |
| Model | Dataset | S. aureus | P. aeruginosa |
|---|---|---|---|
| LA-KSVM | AMPScan | 0.312 | 0.312 |
| DeepAMP | 0.250 | 0.250 | |
| LEV-KSVM | AMPScan | 0.312 | 0.312 |
| DeepAMP | 0.312 | 0.312 | |
| GKM-SVM | AMPScan | 0.312 | 0.312 |
| DeepAMP | 0.375 | 0.375 | |
| AAC-SVM | AMPScan | 0.312 | 0.312 |
| DeepAMP | 0.312 | 0.312 | |
| Literature | AMPScan | 0.250 | 0.250 |
| DeepAMP | 0.438 | 0.438 |
Table 5 displays the accuracy on the external test set for models trained on the species-specific datasets. We also present the performance of the web-tools provided by Vishnepolsky et al.56 As mentioned in Section 2.3.3, at the time of publication, there is no web-tool to predict activity against S. aureus ATCC 29213. Hence, we also provide results for models trained on SA25923 and tasked with predicting activity against S. aureus ATCC 29213. There is clearly a general improvement over the models trained on the AMPScan and DeepAMP datasets, indicating that the models have much greater discriminative power. On the SA29213 dataset, the baseline SVM with amino acid composition kernel is the most predictive with respect to all the metrics. All remaining models achieve the same accuracy and MCC. Considering the SA25923 dataset, the Levenshtein distance produces a model with the same accuracy and MCC as the web-tool but with a larger AUC. It also achieves the same accuracy and AUC as the baseline SVM with amino acid composition kernel, but with a larger MCC. It is interesting to note that the models trained on SA25923 can still make accurate predictions on S. aureus ATCC 29213. Whilst these are two different strains, the findings suggest that the antimicrobial susceptibility to the AMPs is similar for both strains, implying similar mechanisms work in the same species. Considering the PA27853 dataset, the baseline SVM with amino acid composition kernel performs the best against all metrics. We find that the local alignment score, Levenshtein distance and baseline SVM with Gapped k-mer kernel produce equally accurate models, all of which are more accurate than the web-tool. However, the AUC of the local alignment score and Levenshtein distance are considerably higher than that of the baseline SVM with Gapped k-mer kernel. Whilst it is difficult to make any strong conclusions on such a small dataset, it is still encouraging to observe that our models achieve similar accuracy to both the baseline models and web-tools.
| Model | Evaluation metric | Training dataset | ||
|---|---|---|---|---|
| SA29213 | SA25923 | PA27853 | ||
| LA-KSVM | Accuracy | 0.688 | 0.688 | 0.750 |
| AUC | 0.873 | 0.909 | 0.891 | |
| MCC | 0.522 | 0.405 | 0.595 | |
| LEV-KSVM | Accuracy | 0.688 | 0.875 | 0.750 |
| AUC | 0.927 | 0.982 | 0.855 | |
| MCC | 0.522 | 0.764 | 0.595 | |
| GKM-SVM | Accuracy | 0.688 | 0.688 | 0.750 |
| AUC | 0.945 | 0.891 | 0.655 | |
| MCC | 0.522 | 0.405 | 0.389 | |
| AAC-SVM | Accuracy | 0.875 | 0.875 | 0.875 |
| AUC | 0.964 | 0.982 | 0.964 | |
| MCC | 0.709 | 0.709 | 0.764 | |
| DBAASP Web-tool | Accuracy | — | 0.875 | 0.688 |
| AUC | — | 0.945 | 0.718 | |
| MCC | — | 0.764 | 0.522 | |
As the chemical space of natural peptides is extremely large, the development of highly accurate classifiers will help accelerate the discovery and development of novel de novo AMPs. In addition, the promising results generated from this study open a number of possible avenues for further work. Our identification of activity against specific species could be improved using a larger external test set, allowing us to draw stronger conclusions. Furthermore, the methodology we have presented is not specific to the classification of AMPs. We suspect that the Kreĭn-SVM coupled with sequence alignment functions could be applied to other biological-sequence classification tasks. Our computational findings demonstrate not only the feasibility of the proposed approach but more generally the utility of the Kreĭn-SVM as a classification algorithm. Its use of indefinite kernel functions provides a means for practitioners to learn from domain-specific similarity functions without the concern of verifying the positive-definite assumption. This is beneficial since it is often well beyond the expertise of the practitioner to verify this assumption. Whilst we have explored its use when combined with sequence alignment functions, there exist many more indefinite kernel functions with which the Kreĭn-SVM could be combined. Furthermore, the theoretical insight of separately regularising the decomposition components of a function in a Kreĭn space could be applied to develop other indefinite kernel-based learning algorithms. A notable example that relates to the current study would be to extend the One-Class SVM60 to incorporate indefinite kernel functions. The One-Class SVM is a kernel-based learning algorithm that performs anomaly detection.61 It has previously been applied to identify the domain of applicability of virtual screening models.62 An indefinite kernel extension of the One-Class SVM could be directly applied to estimate the domain of applicability of our models.
| This journal is © The Royal Society of Chemistry 2023 |