Atomic partial charge predictions for furanoses by random forest regression with atom type symmetry function

Furanoses that are components for many important biomolecules have complicated conformational spaces due to the flexible ring and exo-cyclic moieties. Machine learning algorithms, which require descriptors as structural inputs, can be used to efficiently compute conformational adaptive (CA) charges to capture the electrostatic potential variations caused by the conformational changes in the molecular mechanics (MM) calculations. In the present study, we introduced atom type symmetry function (ATSF) developed based on atom centered symmetry function (ACSF) for describing conformations for furanoses, in which atoms were categorized by atom types defined by their properties or connectivity in classic molecular mechanics (MM) force field parameters to generate a suitable coordinate size. Random forest regression (RFR) models with ATSF showed improvements for predicting CA charges and dipole moments for furanoses compared to those with ACSF and atom name symmetry functions where atoms were categorized by their unique atom names. The CA charges predicted by RFR models with ATSF showed more comparable reproductions of the carbohydrate–water and carbohydrate–protein interactions computed with RESP charges individually derived from QM calculations than the ensemble-averaged atomic charge sets commonly employed in molecular mechanics force fields, suggesting that the predicted CA charges were capable of including electrostatic variations in their dynamic charge values. Improvements by ATSF showed that categorizing atoms by atom types introduced chemical structural perceptions to descriptors and produced a suitable coordinate size in ATSF to capture key structural features for furanoses. This categorizing scheme also allows ATSF to be readily adopted by other biomolecules thanks to the broad implementations of MM force fields.


Introduction
Furanoses are essential components for the backbones of nucleic acids and complex polysaccharides frequently found in organisms ranging from bacteria to protozoa, fungi to plants. 1 They have complicated conformational spaces as their vemembered ring can adopt multiple stable conformations in addition to the spinning of their abundant exo-cyclic groups in solution. 2 These conformational variations lead to heterogeneous intramolecular properties, such as electrostatic potentials, which affect their recognitions and interactions with other biomolecules. 3 These variations also made it difficult for classical molecular mechanics (MM) force elds to adequately represent the electrostatic properties for furanoses, as static atomic partial charge models are commonly employed. 4 These models, however computationally efficient, lack the accuracy to represent the intrinsic electrostatic potential variations. Efforts have been devoted to develop charge models that are capable of adapting conformational variations. [4][5][6][7][8][9][10] Approaches have been proposed and developed, yet, adoptions of these approaches depend on their applicability and ease of use. It is also unfeasible to derive conformational adaptive (CA) charges during MM calculations directly from electrostatic potentials obtained from resource-hogging quantum mechanics (QM) calculations. 5,11,12 Therefore, it is desirable to efficiently compute QM-quality CA charges that can be used within the classical MM framework.
Machine learning algorithms have been implemented in calculating electrostatic potentials and provide a promising alternative approach for computing CA charges. [13][14][15] The accuracy and efficiency of machine learning algorithms critically depend on the descriptors that are used to represent molecular structures. 16,17 Descriptors for machine learning algorithms, unlike cartesian coordinates, are required to be invariant under permutations of atoms, as well as translations or rotations of the molecule, so as to represent any conformation in a unique set of coordinates. 16 These descriptors also need to describe the key structural features of molecules with a sufficient size of coordinates. The atom centered symmetry function 18 (ACSF) introduced by Behler and Parrinello in 2007 has become a prominent descriptor for machine learning algorithms and many successful implementations have been reported. [19][20][21][22][23][24] ACSF categorizes atoms by the number of atom types in the molecule, which determines the size of its coordinates. Thus, when describing furanoses and other biomolecules that usually possess complicated conformational spaces but limited types of chemical elements, improvements may be needed.
In the present study, we introduced atom type symmetry function (ATSF), that categorized atoms by their atom types dened in MM force elds 3,25-27 and provided more detailed structural descriptions, to predict CA charges with properly trained random forest regression (RFR) models. 28 Atom type is a well-established and crucial concept embedded in common MM force elds, in which atoms are categorized beyond chemical elements and by their properties or connectivity. In the furanose-specic GLYCAM force eld, 3 atoms for furanoses ( Fig. 1) that belong to three chemical elements were further categorized into eight different atom types and the size of coordinates for ATSF increased by more than three times comparing to that for ACSF. The Pearson correlation coefficients for predicted charge values by RFR models with ATSF with reference to RESP charges derived from QM calculations were all above 0.9 and increased averagely by 9% and 4%, respectively, comparing to ACSF and atom name symmetry functions (ANSF). In addition, the predictions of dipole moments were improved by 43% and 48% by charges predicted with ATSF comparing to those with ACSF and ANSF, respectively. Furthermore, the electrostatic related interactions for furanosides, carbohydrate-water and carbohydrate-protein interactions, computed with the CA charges predicted by RFR models with ATSF reduced the average error by more than half for that calculated with the static ensemble-averaged charge models and individual RESP charges derived from QM calculations, which indicated that the predicted CA charges was capable of including electrostatic variations in their dynamic charge values.

Training and testing dataset formation
The training and testing datasets for RFR models were formed by combining the samplings for the endo-and exo-cyclic conformations for furanosides, in order to sufficiently cover their conformational spaces. The conformations for the vemembered ring are determined by the endo-cyclic rotations and can be described by the pseudorotational itinerary, 29,30 where the phase angle (P) and puckering amplitude (s m ) can be calculated by the ve ring torsion angles (Fig. 2).
In order to thoroughly sample the conformational spaces for the furanose ring, s m was iterated from 3 to 45 at a 3 interval, and P was from 0 to 360 at a 6 interval. A total of 900 ring conformations were generated for each furanoside or furanose. Unlike the intertwining endo-cyclic torsions, all exo-cyclic rotations were independent from each other, so it is hardly to systematically cover exo-cyclic conformational spaces. So, 30 combinations of randomly assigned values for all exo-cyclic dihedral angles were created for each ring conformation, in which 24 structures were randomly selected to construct the training dataset, and the rest were selected to construct the testing dataset.

Quantum mechanics (QM) calculations
All of the QM calculations were performed under the same protocol as that in the development of furanose-specic GLY-CAM force eld (ref) with the Gaussian 16 soware package 31 to maintain equal comparisons. Structural optimizations were performed at the HF/6-31G* level of theory, with ve ring torsion angles and all exo-cyclic torsion angles restrained (Fig. 1). The electrostatic potentials were calculated on these optimized structures at the B3LYP/cc-pVTZ level of theory. The atomic partial charges were derived by employing the restrained electrostatic potential (RESP) charge tting methodology with a weak hyperbolic charge restraint weight of 0.0005. 3,32 The charge values for aliphatic hydrogen atoms were assigned to 0 by following GLYCAM force eld parameter development philosophy. 19 Atom type symmetry function ATSF was constructed under the framework of ACSF, whose coordinates were calculated from cartesian coordinates of atoms. 18 The cutoff function in ATSF was that same as that in ACSF: with R c set to 99Å to include all atoms for molecules included in this study. Radial components of atom i were calculated via a sum of Gaussians, in which R s and h are both set to 1.0. Atom j is an atom in atom types J. Please note that the atom i could be in atom type J. The assembly of G radial i with different atom types constructs the radial components in ATSF for atom i.
Angular components for atom i were constructed as: with the parameters of l ¼ 1.0, z ¼ 1.0. Atom j and k are atoms in atom types J and K, respectively. The atom i could also be in atom types J and K. The assembly of G angular i with different combinations of atom types constructs the angular components in ATSF for atom i. For each atom i in the molecule, the ATSF coordinates can be assembled as: where AT stands for atom type. The training and testing sets were generated by S ¼ {(X 1 ,y 1 ), (X 2 ,y 2 ),/, (X n ,y n )}, where y 1 , y 2 , ., y n are the RESP derived charges for the corresponding atoms.
The charge predictions were achieved via multiple random forest regression (RFR) models, the sum of the predicted charges for an individual molecule is not necessarily 0. The corrections were achieved by spreading the discrepancy based on the standard derivations for RFR predictions. This procedure was adopted from ref. 33.

Random forest regression
RFR model was trained for atoms in each element using the scikit-learn library (version 0.18.1) 34 with the following parameters: number of trees ¼ 200, maximum depth ¼ 6, minimum number of samples to split ¼ 6, and minimum number of samples in leaves ¼ 6.

Molecular dynamics (MD) simulations
The initial coordinates for furanosides 1-4 (a and b) were obtained from GLYCAM website (http://www.glycam.org). All systems were solvated with TIP3P water 35 using a 12Å buffer in a cubic box, using the LEaP module in the AMBER16 soware package. 36 Force eld valence parameters were taken from furanose-specic parameters in GLYCAM. 3 The energy minimizations for these solvated furanoses were performed separately under NVT condition (500 steps steepest descent, followed by 24 500 steps of conjugate-gradient minimization). Subsequently, each system was heated to 300 K over a period of 50 ps, followed by equilibration at 300 K for a further 0.5 ns using NPT condition, with the Berendsen thermostat and barostat 37 for temperature and pressure control, respectively. SHAKE algorithm 38 was employed to constrain all covalent bonds involving hydrogen atoms, allowing a simulation time step of 2 fs throughout the simulations. Aer the equilibration, production simulations were carried out with the GPU implementation 39 of the PMEMD.MPI module and trajectory frames collected at every 1 ps from the total of 300 ns. A non-bonded cut-off of 8Å was applied to van der Waals interactions, with long-range electrostatics treated with the particle mesh Ewald approximation.

Hydration free energy and protein-carbohydrate interaction energy calculations
Hydration free energies for 1-4 (a and b) and protein-carbohydrate interaction energies for 3 ATP-binding cassette (ABC) transporters with furanoses as ligands (PDB ID: 2VK2, a-D-Galf-OH and a-D-Galf-OH as the ligands; PDB ID: 3KSM, b-D-Ribf-OH as the ligand) were calculated with molecular mechanics-generalized born surface area (MM-GBSA) method using single trajectory approach. Monosaccharides in all systems were taken as the ligand in the calculation. Solvent molecules and proteins were taken as the receptors in hydration free energy and proteincarbohydrate interaction calculations, respectively. Each MM-GBSA calculation was performed on the 10 000 evenly extracted structures of each solvated system with 3 different charge sets: RESP charges individually derived from QM calculations, predicted charges by RFR models with ATSF, and the ensembleaveraged atomic charges from GLYCAM force eld.

Results and discussion
Different atom categorizing schemes in symmetry functions ATSF in the present study employed atom types dened in furanose-specic GLYCAM force eld 3 (Fig. 3) to divide atoms into more groups beyond chemical elements. When categorizing atoms by only three chemical elements, ACSF contains 9 coordinates. The size of ATSF increased to 41 coordinates when categorizing atoms by a total of eight different atom types. The descriptor needs to be sufficiently large to ensure an unambiguous distinction of different conformations. 16 In addition, categorizing atoms by their atom types introduced structural perceptions to the descriptor by adding information of property or connectivity for atoms. For complete comparisons, the most subtle categorizing scheme was also employed to generate atom name symmetry function (ANSF), where atoms were divided by their atom names (Fig. 3) and each single atom was in a unique category (Fig. 3). The size of ANSF for furanoside was dramatically increased to 276 coordinates and undoubtedly made this scheme unpractical for efficient calculations. Remarkably, ANSF abolished all chemical or structural information from the descriptor.

Performances for atom type symmetry function
The performances of ATSF, ACSF, and ANSF were, rstly, evaluated individually by comparing the predicted charge values to the corresponding expected values, aka RESP values. 32 Data in panel C of Fig. 4 appeared to be less scattered than those in panel B and C, which suggested a better correlation achieved by ATSF than ACSF and ANSF. The Pearson correlation coefficients for the CA charges predicted with ATSF reference to RESP charges derived from QM calculations increased averagely by 9% and 4%, respectively, comparing to ACSF and ANSF. The Pearson coefficients for all atoms were larger than 0.9 for those predicted with ATSF and systematically higher than the other two descriptors. This stronger correlation indicated RFR models with ATSF were able to produce more accurate CA charges than ACSF and ANSF. The results of linear ttings between predicted and RESP-t charges were shown in Table 1. The slopes and intercepts to Y-axis for predictions with ATSF were close to 1 and 0, respectively, implied charges predicted values were not overestimated or underestimated. It is worth noting that RFR models with ANSF, which has a substantially larger size, did not produce higher quality predictions comparing to ATSF. This suggested that the chemical perceptions in categorizing scheme is crucial and increasing the size of coordinates without chemical perceptions would not necessarily guarantee more accurate predictions.
RFR models with ATSF have shown great potentials in predicting atomic partial charge values from different conformations of furanosides. It is also essential to demonstrate their advances in representing the electrostatic potentials of furanosides by reproducing molecular dipole moments, [40][41][42][43][44] which is the rst order of multipole expansions of electrostatic potentials and strongly depend on the conformation of the molecule. The average absolute differences of dipole moments between QM calculated values and the corresponding MM calculated values with different charge models were shown in Table 2. The differences of dipole moments computed from atomic partial charges predicted with ATSF were 43% and 48% smaller than those computed with ACSF and ANSF, respectively. It is worth noting that RESP charge models has the lowest dipole moment differences, which suggested that this charge model is appropriate to represent the electrostatic potential variations for Fig. 3 Atom types (left) and atom names (right) for furanosides in GLYCAM force field. In GLYCAM force field, 3 "Cf" and "Cg" are for endo-and exo-cyclic carbon atoms, respectively. "Of" stands for the endo-cyclic oxygen atoms; "Os" and "Oh" are for exo-cyclic oxygen atoms in ether and hydroxyl groups, respectively. "H1" and "H2" stand for hydrogen atoms attached to a carbon atom that is bonded with one and two electronwithdrawing atoms, respectively; "Ho" is for the hydrogen atoms in hydroxyl groups.
furanosides and be utilized as the references for RFR model training.
So far, RFR models with ATSF demonstrated their capabilities of predicting atomic partial charges with QM quality. To further conrm the validities of ATSF and its CA charges, the electrostatic-related interactions, carbohydrate-water and carbohydrate-protein interactions, for furanoses computed with predicted charges were compared to those with RESP charges.

Performances on carbohydrate-water interaction energy calculations
Carbohydrate-water interactions, quantied by their hydration free energies, substantially depend on their electrostatic interactions. Thus, the quality of the atomic charges can be evaluated by their computed hydration free energies. 33,45 In terms of conformation adaptive charges, the validity can be tested by comparing their computed hydration free energies to those computed with individual RESP-t charges. Moreover, the hydration free energy from a single solute conformation could introduce errors. 46 So, the averaged hydration free energies  computed with conformational adaptive, RESP-t, and ensemble-averaged charge sets for 100 000 structures of each furanoside in 1-4 (a and b) extracted from explicit solvent MD simulations were compared ( Table 3). The hydration free energies calculated with CA charges predicted with ATSF are comparable to those computed with RESP-t charges. The difference is 0.4 kcal mol À1 , which is signicantly less than that computed from ensemble-averaged charge sets (1.0 kcal mol À1 ). The differences among these calculated hydration free energies are signicant (p-value < 0.0001), because of the large amount of structures employed in hydration free energy calculations, although the standard deviations are mostly over 3.0 kcal mol À1 . It is worth noting that the hydration free energies do not include the entropic penalties, therefore, the values may be more negative than the experimental measured values.

Performance on carbohydrate-protein interaction energy calculations
Hydrogen bonding interaction is one of most popular and crucial hydrophilic interactions between carbohydrate molecules and proteins, [47][48][49] due to the richness of hydroxyl groups presence in the exo-cyclic moieties. Accurately representing electrostatic potentials of carbohydrate molecules is crucial for correctly modeling the strength of hydrogen bonds between carbohydrate molecules and proteins. Yet, the static charge model lacks the accuracy for representing the electrostatic variations due to the changes from both endo-and exo-cyclic conformations of furanoses while interacting with proteins. Thus, the CA charge sets could improve the accuracy for carbohydrate-protein interaction energy calculations.
The computed MM-GBSA energies for three ABC transporter complexes with furanoses as ligands were listed in Table 4. The h|error|i for MM-GBSA energies with CA charges predicted by RFR models with ATSF, comparing to that calculated with RESP charges derived from QM calculations, was only 0.3 kcal mol À1 , while that with the ensemble-averaged charges was 0.9 kcal mol À1 .
The CA charge set predicted by RFR models with ATSF, comparing to the static charge model, showed improvements for including the electrostatic potential variations seen by individually derived charge values from QM calculations in their dynamic charge values. Similar to hydration free energies, these values do not include the entropic penalties, therefore, do not reect the results measured from experiments.

Conclusions
Atom type symmetry function (ATSF) categorized atoms by their atom types dened by the properties and connectivity of atoms in MM force eld, beyond chemical elements in ACSF, and formed a more detailed structural description for furanoses that have complicated conformational spaces but limited chemical elements. Hence, the RFR models with ATSF produced more accurate predictions of CA charges and dipole moments than those with ACSF, which suggested ATSF obtained improvements in representing structural information for furanoses. The CA charge predicted by RFR models with ATSF, comparing to the static ensemble-averaged charges, employed in computing carbohydrate-water and carbohydrate-protein interaction energies showed a better agreement to those computed with individually derived RESP charges, which also demonstrated   that CA charges were able to include the electrostatic potentials variations into the dynamic charge values. Improvements achieved by ATSF in representing structural information for furanoses suggested that introducing structural perceptions to the descriptor and increasing the size of the coordinates could improve the performance of ACSF in describing furanoses. Furthermore, ATSF outperforming ANSF that had a signicant larger size of coordinates but removed all chemical or structural perceptions of atoms suggested that categorizing atoms by atom types generated a suitable size of coordinates that represented the key structural features for furanoses. Additionally, this categorizing scheme endued ATSF with the exceeding potent transferability to other biomolecules thanks to the broad implementations of MM force elds for biomolecules.

Conflicts of interest
There are no conicts to declare.