Krein support vector machine classi ﬁ cation of antimicrobial peptides

Antimicrobial peptides (AMPs) represent a potential solution to the growing problem of antimicrobial resistance, yet their identi ﬁ cation through wet-lab experiments is a costly and time-consuming process. Accurate computational predictions would allow rapid in silico screening of candidate AMPs, thereby accelerating the discovery process. Kernel methods are a class of machine learning algorithms that utilise a kernel function to transform input data into a new representation. When appropriately normalised, the kernel function can be regarded as a notion of similarity between instances. However, many expressive notions of similarity are not valid kernel functions, meaning they cannot be used with standard kernel methods such as the support-vector machine (SVM). The Kre ĭ n-SVM represents generalisation of the standard SVM that admits a much larger class of similarity functions. In this study, we propose and develop Kre ĭ n-SVM models for AMP classi ﬁ cation and prediction by employing the Levenshtein distance and local alignment score as sequence similarity functions. Utilising two datasets from the literature, each containing more than 3000 peptides, we train models to predict general antimicrobial activity. Our best models achieve an AUC of 0.967 and 0.863 on the test sets of each respective dataset, outperforming the in-house and literature baselines in both cases. We also curate a dataset of experimentally validated peptides, measured against Staphylococcus aureus and Pseudomonas aeruginosa , in order to evaluate the applicability of our methodology in predicting microbe-speci ﬁ c activity. In this case, our best models achieve an AUC of 0.982 and 0.891, respectively. Models to predict both general and microbe-speci ﬁ c activities are made available as web applications.


Introduction
Kernel methods are a class of machine learning algorithms that incorporate a kernel function in order to model non-linear relationships.Standard kernel methods assume that a given kernel function is positive-denite.Those kernel functions which do not satisfy this assumption are known as indenite kernels.The assumption of positive-deniteness is restrictive, as it limits the number of applicable functions.][3] Leveraging these developments, we study the effectiveness of learning with established sequence-similarity functions for the classication of antimicrobial peptides (AMPs) based on their amino acid sequences.We evaluate the ability of the proposed methodology to predict general antimicrobial activity, as well as antimicrobial activity against specic species.
5][6] These molecules are usually made of 12 to 50 amino acid residues, and they typically possess certain properties, including cationicity, 30-50% hydrophobicity, and amphiphilicity.They exhibit good antimicrobial activity against a broad range of bacteria, viruses, fungi, and parasites.In addition, they have an inherent low risk of developing antimicrobial resistance (AMR), largely attributed to their underlying rapid membrane permeabilising activity. 4,7,8uch broad-spectrum and rapid antimicrobial activity has prompted researchers to consider AMPs as a potential remedy to the growing problem of AMR, which is a major global health threat. 9,10Nonetheless, there has so far been a lack of success in translating AMP-based therapy to clinical use, due to challenges such as complex structure-activity relationship (SAR), drug toxicity, instability in host and infective environment, and low nancial incentives. 11,1225,26 Many of these algorithms are used in combination with a carefully selected set of peptide features, which can be divided into two categories: compositional and physicochemical.The amino acid composition is the simplest example of a compositional feature, which is a vector containing counts of each amino acid in a given peptide.There are various extensions, such as the reduced amino acid composition 27 and the pseudo amino acid composition. 28When computing the reduced amino acid composition, a peptide is represented in a reduced alphabet in which similar amino acids are grouped together.The pseudo amino acid composition accounts for composition as well as sequence-order information, as this is not considered in the standard amino acid composition.The set of physicochemical features include peptide properties such as the charge, hydrophobicity and isoelectric point. 20,24,29These features are typically average values of the respective properties calculated over the length of the peptide.
Classical sequence alignment algorithms, such as the Smith-Waterman 30 and Needleman-Wunsch 31 algorithms, are computationally intensive and do not scale well to large problems.3][34][35] The success of these endeavours notwithstanding, sequence alignment functions are effective notions of biological-sequence similarity that can reect ancestral, structural or functional similarity and therefore should not be overlooked.Several studies have utilised sequence alignment functions for AMP prediction.For example, Wang et al. 36 and Ng et al. 37 utilised the BLAST algorithm 38 in a classication model by comparing the BLAST similarity scores of a query peptide to all those in the training set.Whilst these approaches led to accurate models, the BLAST algorithm is a heuristic method that nds only approximate optimal alignments.This approximation leads to generally faster results than what could be obtained by the Smith-Waterman algorithm, and it is one of the main reasons practitioners choose to use it.However, on the relatively small datasets in the aforementioned studies, it is interesting to consider whether the same approaches using the optimal alignment score would improve the models.
The SVM is a well-known ML algorithm for classication and can incorporate a kernel function in order to learn non-linear classication boundaries.The kernel function greatly inuences the performance of the resulting classication model.When appropriately normalised, a kernel function can be regarded as a similarity function.A useful kernel function should produce similarities that are relevant to the problem.Many expressive notions of similarity are not valid kernel functions, [39][40][41][42] in that they are indenite, meaning they cannot be used with an SVM.Recent developments have now alleviated this problem, facilitating a much larger class of similarity functions to be used in conjunction with an SVM.Loosli et al. 1 present an algorithm for learning an SVM with indenite kernels.Their approach relies on a method of stabilisation, meaning there is no guarantee of global optimality.On the other hand, the Kreĭn-SVM 3 is an algorithm for learning an SVM with indenite kernels that is guaranteed to nd a globally optimal solution.
In this work, we utilised the Kreĭn-SVM algorithm to assess the effectiveness of sequence alignment functions for AMP classication.We performed an empirical comparison of both the local alignment score and Levenshtein distance 43,44 on two AMP datasets from the literature.Furthermore, we tested the ability of our approach to detect activity against specic species on a dataset of experimentally validated peptides measured against both Staphylococcus aureus ATCC 29213 and Pseudomonas aeruginosa ATCC 27853, henceforth referred to as S. aureus ATCC 29213 and P. aeruginosa ATCC 27853.Our trained models are made available as web applications at http:// comp.chem.nottingham.ac.uk/KreinAMP, for the prediction of both general and species-specic activities.

Methods
AMP prediction models were developed using the Kreĭn-SVM algorithm in conjunction with sequence alignment functions, which we formally dene in this section.We initially describe the more familiar SVM, before moving on to the Kreĭn-SVM.We then dene the Levenshtein distance 44 and the local alignment score.Finally, we describe our computational and microbiological methodology.The decision surface associated with a hyperplane is inherently linear, which can be restrictive when the two classes are not linearly separable.This issue is mitigated through the use of a kernel function, which implicitly maps the instances into a new space.The space in which the instances are mapped is known as a reproducing kernel Hilbert space (RKHS), and every kernel function is uniquely associated to a RKHS.When incorporating a kernel function, the SVM nds the maximummargin hyperplane in the associated RKHS and this can correspond to a non-linear decision surface in the original space.This surface allows greater exibility when separating the two classes but also increases the possibility of overtting.Perfectly separating a noisy dataset is oen indicative of over-tting.The So Margin SVM helps alleviate this problem by allowing some instances to reside on the incorrect side of the hyperplane.Eqn (1) presents the optimisation problem that is solved when training a So Margin SVM with L2 loss, commonly known as an L2-SVM.We opted for an L2-SVM as greater penalisation is placed on instances residing on the incorrect side of the hyperplane.
subject to We denote by x i the i-th training instance and by y i its corresponding label.Furthermore, H denotes the RKHS from which a solution is found.The solution is a function f ˛H and a vector of slack variables x which minimise the objective function, subject to the constraints.The constraint y i f(x i ) $ 1 − x i imposes that the i-th instance lies on the correct side of the hyperplane and that its distance to the hyperplane is greater than or equal to the margin.An instance x j which does not satisfy this constraint contributes a value of x j 2 to the objective function, where x j is the distance between x j and the margin.The P n i¼1 x 2 i term in the objective function is the total contribution of all instances which do not satisfy the constraints.The kf k H 2 term is the norm of the considered function and acts as a measure of complexity.The hyperparameter l can be tuned to provide a balance between the number of instances that violate the constraints and the complexity of the considered function.
2.1.2Kreĭn-SVM.The Kreĭn-SVM is a classication algorithm that is dened to incorporate a much broader class of kernel functions, known as indenite kernel functions.Similarly to a standard kernel function, an indenite kernel function implicitly maps instances into a new space.However, the space associated to an indenite kernel function is known as a reproducing kernel Kreĭn Space (RKKS).Whilst operating in different spaces, the SVM and Kreĭn-SVM are conceptually similar.Both algorithms incorporate a kernel function, nd the maximum-margin hyperplane in the associated space and are capable of learning non-linear decision surfaces.
subject to Eqn (2) presents the optimisation problem that is solved when training the Kreĭn-SVM with L2 loss.We denote by K the associated RKKS.Similarly to the SVM, the solution is a function f ˛K and a vector of slack variables x which minimise the objective function.The rst term in the objective function, as well as the constraints, have the same interpretation as in eqn (1).The only notable difference to the SVM is the method of regularisation.Any RKKS K can be expressed as a direct sum of the form where H AE are RKHSs.This means that any function f ˛K can be decomposed as f = f + 4 f − , where f AE ˛H AE .Hence, regularisation in eqn ( 2) is performed by separately regularising each component of the decomposition.The hyperparameters l ± can be tuned to provide a balance between the number of instances that violate the constraints and the regularisation of each decomposition component.

Sequence similarities and distances
We now proceed to dene the sequence similarities used throughout this work.This section closely follows the works of Setubal and Meidanis 45 and Yujian and Bo. 43First, we clarify our terminology.

Notation and terminology.
Let S be a nite alphabet, S n 4 S be the set of all strings of length n from S and S* the set of all strings from that alphabet.A string s ˛S* of length n is a sequence of characters that can be indexed as s = s 1 .sn .Given a string s ˛Sn , we say that u ˛Sm is a subsequence of s if there exists a set of indices I = {i 1 , ., i m } with 1 # i 1 # .# i m # n, such that u j = S i j for j = 1, ., m.We write u = s[I] for short.We say that v ˛Sl is a substring of s if v is a subsequence of s with index set J = {j 1 , ., j l } such that j r+1 = j r + 1 for r = 1, ., l − 1.That is, v is a subsequence consisting of consecutive characters of s.
2.2.2 Global alignments.The goal of a sequence alignment is to establish a correspondence between the characters in two sequences.In the context of bioinformatics, a pairwise alignment can indicate ancestral, structural or functional similarities between the pair of sequences.In this section, we provide a formal review of global sequence alignment.
Denition 2.1 (global alignment).Let S be an alphabet and let s ˛Sn and t ˛Sm be two strings over S. Dene S g = S W {"−"} as the extension of S with the gap character "−".The tuple a(s, t) = (s ′ , t ′ ) is a global alignment of sequences s and t if and only if (1) The subsequence of s ′ obtained by removing all gap characters is equal to s, (4) The subsequence of t ′ obtained by removing all gap characters is equal to t, (5) fi s 1 provides a formal denition of global alignment.Whilst many possible alignments exist for two strings, the goal of sequence alignment is to nd an alignment that optimises some criterion.A scoring function, presented in Denition 2.2, can be used to quantify the "appropriateness" of an alignment.An optimal global alignment is then one which is optimal with respect to a given scoring function, as shown in Denition 2.3.
Denition 2.2 (scoring functions).Let S be an alphabet, S g = S W {"−"} be the extension of S with the gap character "−" and p : S g Â S g /ℝ be a function dened over the elements of S g .We say p is a similarity scoring function over S g if, for all x, y S, we have (1) p(x, x) > 0, (2) p(x, x) > p(x, y), (3) p(x, y) = p(y, x), (4) p(x, "−") # 0, (5) p("−", "−") = −N.Similarly, we say p is a distance scoring function over S g if, for all x, y, z ˛S, we have (1) p(x, x) = 0, (2) p(x, y) > 0, (3) p(x, y) = p(y, x), (4) p(x, "−") > 0, (5) p(x, z) # p(x, y) + p(y, z), (6) p("−", "−") = N. Denition 2.3 (optimal global alignment).Let S be an alphabet, S g = S W {"−"} be the extension of S with the gap character "−" and consider two strings s ˛Sn , t ˛Sm .Let a(s, t) = (s ′ , t ′ ) be a valid global alignment of s and t (valid in the sense that it satises the conditions of Denition 2.1), p : S g Â S g /ℝ be a scoring function over S g and Aðs; tÞ be the space of all valid alignments of s and t.The score S p ðaðs; tÞÞ of a(s, t) with respect to the scoring function p is dened as Since an alignment is optimal with respect to a given scoring function, it is natural to consider which scoring function to use in order to obtain the most meaningful alignments.In the context of biological sequences, researchers have been considering this problem for many years.A number of families of scoring matrices have been designed to encode useful notions of similarity.In this work, we only considered the BLOSUM62 scoring matrix, 46 as it is a standard choice when performing sequence alignment.In order to dene more complex transformations, one can consider the consecutive application of a sequence of elementary edit operations.Of special interest to the Levenshtein distance are those sequences of operations that transform one string into another.Such a sequence is known as an edit path and its length is dened as the number of operations in the sequence.The Levenshtein distance between two strings is dened as the length of the minimum length edit path, as seen in Denition 2.5.
Denition 2.5 (Levenshtein distance).Let S be an alphabet and consider two strings s ˛Sn and t ˛Sm from S. An edit path from s to t is denoted by P s,t and represents a sequence of elementary edit operations that transforms s into t.Denote by jP s,t j the number of operations contained in P s,t and by P s;t the space of all edit paths from s to t.Then the score S p ða * ðs; tÞÞ of a*(s, t) with respect to p is equal to the cardinality of U.This is exactly equal to the Levenshtein distance between s and t.

Local alignments.
A global alignment produces an alignment which spans the whole length of a pair of strings.It is based on the assumption that the strings are related in their entirety.This assumption can be restrictive, since it is oen the case that certain substrings exhibit high similarity whilst others do not.A local alignment produces an alignment that nds those high similarity substrings.That is, it nds the highest scoring global alignment from all possible substrings of the pair of strings.We formalise this notion in Denition 2.6.
Denition 2.6 (optimal local alignment).Let S be an alphabet, S g = S W {"−"} be the extension of S with the gap character "−" and p : S g Â S g /ℝ be a similarity scoring function.For a string s ˛Sn , let I s be the space of all index sets such that for any I s ˛I s , s[I s ] is a valid substring of s.Similarly, for a string t ˛Sm , let I t be the space of all index sets such that for any I t ˛I t , t[I t ] is a valid substring of t.

Computational methodology
This section discusses the setup of our computational evaluation, as well as the datasets used.2.3.1 Computational setup.To assess the usage of learning with sequence alignment functions, we performed a series of computational experiments on a number of AMP datasets.In each of our evaluations, we tested both the local alignment score (LA) and the Levenshtein distance (LEV) in conjunction with the Kreĭn-SVM algorithm.We compare against two baselines: an SVM with amino acid composition kernel and an SVM using the gapped k-mer kernel.The former is a positive-denite kernel function; peptides are represented via their amino acid composition and the kernel is dened as the inner product under this representation.The latter is also a positive-denite kernel function.It has produced accurate models in a number of biological-sequence classication tasks [47][48][49] and hence makes for a useful baseline.When applicable, we also compared our models with AMP identication tools from the literature.The parasail package 50 was used to compute local alignment scores.We only considered normalised variants of the local alignment score and Levenshtein distance, with the normalisation performed according to Schölkopf et al. 51 and Yujian and Bo, 43 respectively.We report the accuracy, the area under the receiver operating characteristic curve (AUC) and Matthews correlation coefficient (MCC) to compare models.The accuracy and MCC are dened in eqn (3), where TP, TN, FP and FN are the number of true-positives, true-negatives, false-positives and falsenegatives, respectively.
The AUC is dened as the probability that a classier will score a randomly chosen positive instance higher than a randomly chosen negative instance. 52In order to allow for a fair comparison, all models used the same training and test splits.The optimal hyperparameters were selected by performing an exhaustive grid search over the training set, using 10-fold cross validation.The l hyperparameter of the SVM algorithm, as well as the l + and l − hyperparameters of the Kreĭn-SVM algorithm, were selected from {0.01, 0.1, 1, 10, 100}.The Levenshtein distance and amino acid composition kernel have no hyperparameters to control; we used the default values for the hyperparameters of the local alignment score.The gapped kmer kernel has two hyperparameters g and m and is quite susceptible to their values.The optimal value of g was selected from {1, 2, 3, 4, 5} and the optimal value of m was selected from {1, 2, 3, ., 10}.It is required that g > m, so only valid combinations of the two were considered.In our nested crossvalidation experiments, we used 10 inner and 10 outer folds and the reported results are averaged over the outer fold test sets.
2.3.2General antimicrobial datasets.We selected two AMP classication datasets from the literature, which we refer to as AMPScan 26 and DeepAMP, 25 in order to test the ability of approach to predict general antimicrobial activity.Detailed discussions on the creation of these datasets can be found in the original studies.The AMPScan and DeepAMP datasets contain 3556 and 3246 instances, respectively.Each dataset also contains a 50 : 50 ratio of AMPs to non-AMPs, allowing us to avoid issues that result from class imbalance.Associated to each dataset is a specic test set, and reporting results on this set allows comparison with the authors' proposed models.Despite being of similar size, one major differentiating factor between the two datasets is the length of peptides.Fig. 1 displays the empirical distribution of peptide lengths for both datasets, partitioned by peptide classication.In both cases, the distributions corresponding to AMPs and non-AMPs are very similar.However, the distributions across datasets are clearly very different.The Kolmogorov-Smirnov two-sample test 53 provides evidence to reject the null hypothesis that the peptide length distributions of the AMPScan and DeepAMP datasets are identical.DeepAMP contains generally shorter peptides than AMPScan.Indeed, the DeepAMP dataset was curated since short-length AMPs have been shown to exhibit enhanced activity, lower toxicity and higher stability as opposed to their longer counterparts. 54,55More importantly, synthesis is cheaper for the short AMPs than the full-length AMPs, which increases the potential for clinical translation and commercialisation. 12Fig. 2 displays the empirical amino acid distributions for both datasets, indicating their similarity.
2.3.3Species-specic datasets.To test the ability of our methodology to identify activity against specic species, we have utilised an external dataset of 16 peptides with minimum inhibitory concentration (MIC) measured against S. aureus ATCC 29213 and P. aeruginosa ATCC 27853.To make this dataset suitable for classication, we label a peptide as active if its MIC <100 mg mL −1 and inactive otherwise.Further details of the microbiological experimentation used to construct this dataset can be found in Section 2.4.
Vishnepolsky et al. 56 have shown that, given an appropriate training dataset, predictive models of peptide activity against specic species can be constructed.This involves training a separate model for each species of interest, which, in this case, was the DBSCAN algorithm. 57Their model predicting activity against E. coli ATCC 25922 achieved a balanced accuracy of 0.79, which was greater than a number of common AMP prediction tools. 56Furthermore, models to predict activity against S. aureus ATCC 25923 and P. aeruginosa ATCC 27853 were made publicly available as web-tools.
We follow the methodology of Vishnepolsky et al. to construct useful training datasets for our problem.We utilised the Database of Antimicrobial Activity and Structure of Peptides (DBAASP) as a source of data.DBAASP contains peptide activity measurements against a wide-range of species, 58 including those of interest to us.We extracted from DBAASP all peptides with activity measured against S. aureus ATCC 29213, S. aureus ATCC 25923 or P. aeruginosa ATCC 27853 subject to the following conditions: (i) peptide length in the range [6, 18], (ii) without intrachain bonds, (iii) without non-standard amino acids and (iv) MIC measured in mg mL −1 or mM.Condition (i) was imposed as that is the range of peptide lengths in our external test set.Conditions (ii) and (iii) were imposed following the recommendation of the Vishnepolsky et al. and condition (iv) was imposed as conversion from mM to mg mL −1 is possible by estimating the molecular weight of a given sequence.Since no web-tool to predict activity against S. aureus ATCC 29213 was available, we couldn't directly compare our results.Instead, we collected data for peptides active against S. aureus ATCC 25923.This allowed us to compare our models with the state of the art provided by Vishnepolsky et al.
Three separate datasets of peptides with activity measured against S. aureus ATCC 29213, S. aureus ATCC 25923 and P. aeruginosa ATCC 27853 were created using the data collected from DBAASP.We refer to these datasets as SA 29213 , SA 25923 , and PA 27853 , respectively.The peptides in the respective datasets were labelled according to their activity against the specic strain.Each dataset is constructed from highly active peptides (MIC #25 mg mL −1 ) and inactive peptides (MIC $100 mg mL −1 ).A peptide with 25 mg mL −1 < MIC <100 mg mL −1 would not be included in our training dataset.This large interval allows us to account for experimental errors, which in turn increases the condence in our class labels.In the case that a peptide was associated to multiple activity measurements, the median value was taken to represent its activity.As shown in Table 1, the three training datasets are all relatively small and contain slightly more active peptides than inactive peptides.

Microbiological experiments
A previously established dataset of 16 short-length peptides (18 amino acids or shorter in length) was used to test the ability of the developed ML algorithms in predicting antimicrobial activity against S. aureus and P. aeruginosa. 7These peptides were commercially synthesised by Mimotopes (Mulgrave Victoria, Australia) via solid phase Fmoc synthesis and were puried by reverse phase high performance liquid chromatography (RP-HPLC) to >95% purity.The efficacy of these peptides was already experimentally validated using established MIC assay with broth microdilution method approved by the Clinical and   Laboratory Standards Institute.Full detail of the previously conducted microbiological experiment can be found in the previous study. 7Results In Section 3.1, we discuss the ability of our models to identify general antimicrobial activity.We observe that our proposed models consistently outperform the baselines and, in some cases, the models proposed in the literature.One shortcoming of any computational model that identies general antimicrobial activity is that it cannot be used to identify activity against specic species.We address this shortcoming in Section 3.2, by training our models to identify activity against S. aureus ATCC 29213 and P. aeruginosa ATCC 27853 on an experimentally validated dataset of 16 peptides, for which the proposed approach produces accurate models.
3.1 Identifying general antimicrobial activity 3.1.1Nested cross-validation.The performance of the AMP classiers on the considered datasets is reported in Table 2.The results are averaged over the multiple test sets generated by nested-cross validation.On both datasets, the proposed models achieve a greater average accuracy, AUC and MCC than the baselines, with the local alignment score achieving the best values in all cases.The Welch t-test, 59 with p = 0.05, is used to compare the test set AUC of our proposed models against the baseline SVM with Gapped k-mer kernel across the outer folds of nested cross-validation.Adjusting for the testing of multiple hypotheses with the Bonferroni correction, we observe a significant difference between the mean AUC of the local alignment score and that of the baseline SVM with Gapped k-mer kernel on the AMPScan dataset.All other comparisons, including those on the DeepAMP dataset, are not signicant.The performance of all models is greater on the AMPScan dataset than on the DeepAMP dataset.This may, in part, be due to the fact that DeepAMP contains generally shorter peptides.Shorter peptides provide less information to the models, which can limit their discriminative ability.
3.1.2Predened test set.Table 3 reports the results of all models on the predened test set associated with each dataset.For the sake of completeness, we also include the performance of the neural network-based classiers proposed by the authors of each dataset.As for the nested cross-validation results, we observe that all models perform better on the AMPScan dataset than the DeepAMP dataset.Considering the former, the local alignment score achieves the largest accuracy, AUC and MCC, and is followed closely by the literature model.On the DeepAMP dataset, the performance is similar among all methods.The local alignment score achieves the best AUC but the Levenshtein distance achieves the best accuracy and MCC.However, the Levenshtein distance outperforms the literature model against all metrics.On both datasets, the baselines are the least predictive models.It is encouraging to observe that the sequence alignment functions can produce classiers that match, and also outperform, the neural network-based classiers.

Identifying species-specic activity
In this section, we highlight the ability of our models to identify AMPs that are active against specic species, particularly S. aureus ATCC 29213 and P. aeruginosa ATCC 27853.Table 4 displays the accuracy on the external test set for models trained on the AMPScan and DeepAMP datasets.Once again, we also include the performance of the neural-network-based classiers proposed by the authors of the AMPScan and DeepAMP datasets.We observe that the performance of all models is very poor.We noticed in our investigations that the majority of models predicted active for a large proportion of the peptides.This  general poor performance is to be expected.We attribute it to the fact that these models have been trained to recognise if a peptide exhibits antimicrobial activity against any type of species.It is therefore unreasonable to assume that they are able to discriminate activity against a specic species.Table 5 displays the accuracy on the external test set for models trained on the species-specic datasets.We also present the performance of the web-tools provided by Vishnepolsky et al. 56 As mentioned in Section 2.3.3, at the time of publication, there is no web-tool to predict activity against S. aureus ATCC 29213.Hence, we also provide results for models trained on SA 25923 and tasked with predicting activity against S. aureus ATCC 29213.There is clearly a general improvement over the models trained on the AMPScan and DeepAMP datasets, indicating that the models have much greater discriminative power.On the SA 29213 dataset, the baseline SVM with amino acid composition kernel is the most predictive with respect to all the metrics.All remaining models achieve the same accuracy and MCC.Considering the SA 25923 dataset, the Levenshtein distance produces a model with the same accuracy and MCC as the webtool but with a larger AUC.It also achieves the same accuracy and AUC as the baseline SVM with amino acid composition kernel, but with a larger MCC.It is interesting to note that the models trained on SA 25923 can still make accurate predictions on S. aureus ATCC 29213.Whilst these are two different strains, the ndings suggest that the antimicrobial susceptibility to the AMPs is similar for both strains, implying similar mechanisms work in the same species.Considering the PA 27853 dataset, the baseline SVM with amino acid composition kernel performs the best against all metrics.We nd that the local alignment score, Levenshtein distance and baseline SVM with Gapped k-mer kernel produce equally accurate models, all of which are more accurate than the web-tool.However, the AUC of the local alignment score and Levenshtein distance are considerably higher than that of the baseline SVM with Gapped k-mer kernel.Whilst it is difficult to make any strong conclusions on such a small dataset, it is still encouraging to observe that our models achieve similar accuracy to both the baseline models and webtools.

Conclusions
We have assessed the capabilities of sequence alignment functions coupled with the Kreĭn-SVM as AMP classication models.Our investigations indicate that the proposed methodology produces accurate classiers of both general and species-specic antimicrobial activity of AMPs.The utility of our methodology is twofold.Firstly, since sequence alignment algorithms operate directly on amino acid sequences, these methods do not explicitly require the use of peptide features.This removes the need for the practitioner to decide which features to use, which is oen a detailed and time-consuming process.Secondly, in all of our experiments, we used the local alignment score with its default hyperparameters.Having achieved such promising results, it prompts the question of whether more accurate models could be attained by also tuning the various hyperparameters of the local alignment score, such as the choice of scoring function, which we will explore in future work.
As the chemical space of natural peptides is extremely large, the development of highly accurate classiers will help accelerate the discovery and development of novel de novo AMPs.In addition, the promising results generated from this study open a number of possible avenues for further work.Our identication of activity against specic species could be improved using a larger external test set, allowing us to draw stronger conclusions.Furthermore, the methodology we have presented is not specic to the classication of AMPs.We suspect that the Kreĭn-SVM coupled with sequence alignment functions could be applied to other biological-sequence classication tasks.Our computational ndings demonstrate not only the feasibility of the proposed approach but more generally the utility of the Kreĭn-SVM as a classication algorithm.Its use of indenite kernel functions provides a means for practitioners to learn from domain-specic similarity functions without the concern of verifying the positive-denite assumption.This is benecial since it is oen well beyond the expertise of the practitioner to verify this assumption.Whilst we have explored its use when combined with sequence alignment functions, there exist many more indenite kernel functions with which the Kreĭn-SVM could be combined.Furthermore, the theoretical insight of separately regularising the decomposition components of a function in a Kreĭn space could be applied to develop other indenite kernel-based learning algorithms.A notable example that relates to the current study would be to extend the One-Class SVM 60 to incorporate indenite kernel functions.The One-Class SVM is a kernel-based learning algorithm that performs anomaly detection. 61It has previously been applied to identify the domain of applicability of virtual screening models. 62An indenite kernel extension of the One-Class SVM could be directly applied to estimate the domain of applicability of our models.

2. 1
SVM and Kreĭn-SVM 2.1.1SVM.The SVM is a ML algorithm used for classication.The result of training an SVM is a hyperplane whose distance to the closest training instance, in either class, is maximal.Furthermore, instances from each class are required to reside on separate sides of the hyperplane.New instances are classied based solely on which side of the hyperplane they are located.The distance from the hyperplane to the closest training instance is known as the margin.The hyperplane that maximises the margin is the maximum-margin hyperplane and this is what is produced when training an SVM.

2 . 2 . 3
Levenshtein distance.The string edit distance denes a useful notion of distance between a pair of strings.It is informally dened as the minimum number of edit operations required to transform one string into another.The Levenshtein distance is a variant of the string edit distance that allows the operations of substitution, deletion and insertion of characters, and these are dened in Denition 2.4.Denition 2.4 (elementary edit operations).Let S be an alphabet.For two characters a, b ˛S, we denote by a / b the elementary edit operation that substitutes a with b.Denoting by 3 the null character (the empty string), we can dene the elementary edit operations of insertion and deletion as 3 / b and a / 3, respectively.

Denition 2 .
5 shares some interesting similarities with Denition 2.3.Both problems solve a combinatorial optimisation problem and, indeed, the Levenshtein distance can be realised as a special case of global alignment.More specically, consider the distance scoring function p: S g × S g / {0, 1} dened as pðx; yÞ ¼ ( 0; if x ¼ y 1; otherwise For two strings s ˛Sn and t ˛Sm , let their optimal global alignment with respect to p be equal to a*(s, t) = {s ′ , t ′ }.Dene the set U as

Fig. 1
Fig. 1 The distribution of peptide lengths for (a) the AMPScan and (b) DeepAMP datasets.

Fig. 2
Fig.2The distribution of amino acids for the AMPScan and DeepAMP datasets.
The Levenshtein distance d L (s, t) between s and t is dened as d L ðs; tÞ ¼ min For any I s ˛I s and I t ˛I The optimal local alignment a * L ðs; tÞ of s and t with respect to p is dened as a * L ðs; tÞ ¼ arg max t , denote by a*(s[I s ], s[I t ]) the optimal global alignment of s[I s ] and t[I t ] with respect to p. * ðs½I s ; t½I t Þ Á :

Table 1
Descriptive statistics of the SA 29213 , SA 25923 and PA 27853 datasets

Table 2
Quality of the predictions on the AMPScan and DeepAMP datasets.The average accuracy AUC and MCC are reported (standard deviation in parentheses), computed over the outer fold test sets of the nested cross-validation procedure.Results are presented for the Kreĭn-SVM with local alignment score (LA-KSVM), the Kreĭn-SVM with Levenshtein distance (LEV-KSVM), the SVM with Gapped k-mer kernel (GKM-SVM) and the SVM with amino acid composition kernel (AAC-SVM)

Table 3
25,26ty of the predictions on the AMPScan and DeepAMP datasets.The accuracy, AUC and MCC are reported, computed on the predefined test sets.Results are presented for the Kreĭn-SVM with local alignment score (LA-KSVM), the Kreĭn-SVM with Levenshtein distance (LEV-KSVM), the SVM with Gapped k-mer kernel (GKM-SVM) and the SVM with amino acid composition kernel (AAC-SVM).Results from the respective neural network-based classifiers25,26proposed by the authors of each dataset are also presented, denoted by Literature

Table 4
25,26ctive accuracy of the Kreĭn-SVM with local alignment score (LA-KSVM), the Kreĭn-SVM with Levenshtein distance (LEV-KSVM) and the SVM with Gapped k-mer kernel (GKM-SVM) on the species-specific test sets of 16 peptides.Results from the respective neural network-based classifiers25,26proposed by the authors of each dataset are also presented, denoted by literature.The dataset column indicates which dataset a model was trained on.The heading S. aureus indicates the model was predicting activity against S. aureus ATCC 29213 and the heading P. aeruginosa indicates the model was predicting activity against P. aeruginosa ATCC 27853

Table 5 Quality
56,58ts from the DBAASP Web-tools are also presented.56,58Theheadings in the third to fifth columns indicate which dataset the models were trained on.The models trained on both SA 29213 and SA 25923 were tasked with predicting activity against S. aureus ATCC 29213.The models trained on PA 27853 were tasked with predicting activity against P. aeruginosa ATCC 27853.Numbers highlighted in bold indicate the largest AUC achieved on the respective dataset of the predictions of the Kreĭn-SVM with local alignment score (LA-KSVM), the Kreĭn-SVM with Levenshtein distance (LEV-KSVM), the SVM with Gapped k-mer kernel (GKM-SVM) and the SVM with amino acid composition kernel (AAC-SVM) on the test sets of 16 peptides.