Prediction of the taxonomical classification of the Ranunculaceae family using a machine learning method†
Abstract
Ranunculaceae is a botanical source for various pharmaceutically active compounds, which has been commonly utilized in traditional Chinese medicine. Increasing interest in Ranunculaceae pharmaceutical resources has led to a taxonomical study of this family, which might provide new insight to understand its diversification, relationship and phylogenetic position, and further to find new medicinal resources and promising compounds. In this study, we used the machine learning method to explore the classification of the medicinal Ranunculaceae family. 204 species representing 17 genera of the Ranunculaceae family were collected from the TCMID with their 1280 active compounds composed of structure-based fingerprints. After the construction of species-compound and genus-compound matrices, CNNs and Ext fingerprints were determined as the best machine learning method and fingerprint type using ACC and F-score as clustering criteria, respectively. We found that taxonomical classification within the Ranunculaceae family could be accurately predicted, especially at the genus level with a top ACC of 0.86 and an F-score of 0.85. The top features of compounds that were important for the classification of 17 genera were also identified, and thus some genera with high medicinal values were associated with characteristic cis and (or) trans features. As far as we know, this is the first time that some genera are found to be associated with the structural features of compounds.