Advancing energy storage through solubility prediction: leveraging the potential of deep learning †

Solubility prediction plays a crucial role in energy storage applications, such as redox flow batteries, because it directly aﬀects the eﬃciency and reliability. Researchers have developed various methods that utilize quantum calculations and descriptors to predict the aqueous solubilities of organic molecules. Notably, machine learning models based on descriptors have shown promise for solubility prediction. As deep learning tools, graph neural networks (GNNs) have emerged to capture complex structure–property relationships for material property prediction. Specifically, MolGAT, a type of GNN model, was designed to incorporate n-dimensional edge attributes, enabling the modeling of intricacies in molecular graphs and enhancing the prediction capabilities. In a previous study, MolGAT successfully screened 23467 promising redox-active molecules from a database of over 500000 compounds, based on redox potential predictions. This study focused on applying the MolGAT model to predict the aqueous solubility (log S ) of a broad range of organic compounds, including those previously screened for redox activity. The model was trained on a diverse sample of 8494 organic molecules from AqSolDB and benchmarked against literature data, demonstrating superior accuracy compared with other state of the art graph-based and descriptor-based models. Subsequently, the trained MolGAT model was employed to screen redox-active organic compounds identified in the first phase of high-throughput virtual screening, targeting favorable solubility in energy storage applications. The second round of screening, which considered solubility, yielded 12332 promising redox-active and soluble organic molecules suitable for use in aqueous redox flow batteries. Thus, the two-phase high-throughput virtual screening approach utilizing MolGAT, specifically trained for redox potential and solubility, is an effective strategy for selecting suitable intrinsically soluble redox-active molecules from extensive databases, potentially advancing energy storage through reliable material development. This indicates that the model is reliable for predicting the solubility of various molecules and provides valuable insights for energy storage, pharmaceutical, environmental, and chemical applications.


Introduction
The continued use of fossil fuels has led to rising levels of carbon dioxide (CO 2 ) in the atmosphere.As a result, renewable energy sources such as solar and wind power have been investigated as cleaner alternatives to fossil fuels. 1 However, these renewable sources are intermittent for storing the energy produced to balance energy supply and demand.To address this challenge, researchers are developing low-cost and scalable energy storage systems (ESS) using rechargeable batteries that can enable large-scale storage of renewable energy. 2,3Lithium ions, redox flow, lead acid, sodium-sulfur (Na-S), and nickelmetal hydride (Ni-MH), systems are rechargeable battery systems that offer an efficient way to store energy.Although lithium-ion batteries are commonly used in applications, their ability to be utilized for grid-connected storage is limited because of factors such as the limited availability of raw materials, the concentration of resources in specific geographic locations, and safety issues. 4Many efforts are being made to improve the stability and cost-effectiveness of production and the environmental sustainability of materials used in lithiumion batteries. 5Redox flow batteries (RFBs) are promising energy storage alternatives for grid scale applications due to their modular architecture, better scalability, low maintenance cost, and customizable operation. 6queous redox flow batteries (ARFBs) are a form of RFB that chemically store energy using two distinct redox-active species with variable reduction potentials. 4The negolyte and posolyte are pumped through an electrochemical cell composed of two electrodes (often carbon) isolated by an ion-selective membrane.The volume of the electrolyte storage tanks, quantities of redox-active chemicals, and variances in their reduction potentials affect the ARFB energy storage capacity.The power of an electrochemical cell stack is determined by its active species.This design allows the independent scaling of energy and power, which is not possible in enclosed batteries.ARFBs have been identified as highly potential grid-scale storage for energy alternatives owing to their scalability and unique design, which allows distinct control over power and energy output. 7The most important components of ARFBs are the redox-active species, and their redox potential and solubility determine their overall energy density. 1,8These redox-active organic compounds are integral to the functioning of redox flow batteries and serve as crucial components of electrolyte solutions.These compounds facilitate charge transfer between the electrodes, enabling the charge-discharge cycle of the cell.Unlike traditional metal-based electrodes, redox-active organic compounds offer several advantages, including higher energy density, a longer lifespan, and a lower cost. 9,10Consequently, they have emerged as key elements in advanced battery technologies, providing a viable alternative to conventional rechargeable batteries for grid-stabilization applications. 11,12olubility prediction is essential in various fields, including ARFBs, because the poor solubility of organic molecules can lead to inefficient and unreliable system.Hence, an accurate prediction of the solubility of compounds is vital for the design of efficient and reliable energy storage application.Aqueous solubility of organic compound is an important attribute to explore since it has a direct impact on the power density, energy capacity, and energy density of aqueous redox flow batteries.The variables that influence the solubility of organic molecules in water include electrostatic interactions with water, solvent reorganization, delocalized charge density in aromatic rings, entropic contributions, and inter-and intramolecular hydrogen bonding.The thermodynamics of organic molecule aqueous solvation is a complex process involving many distinct sorts of interactions. 7Understanding and optimising organic compound aqueous solubility is therefore crucial for increasing the performance and efficiency of these next-generation energy storage devices. 7It also plays a significant role in chemical design processes, environmental studies, and drug development applications. 13,14In this context, the aqueous solubility of organic molecules has gained significant attention as a crucial property affecting various physical phenomena. 15Several methods can be used to predict the water solubility of organic compounds based on their chemical structure.Despite efforts to develop different models for precisely calculating water solubility, determining the solubility of organic compounds remains a challenging and time-consuming task. 14,16Scholars have investigated four approaches for solving this problem.First, empirical methods such as the generalized solubility equation (GSE) are used. 17Second, quantitative approaches based on structure-property relationships (QSPR) and cheminformatic descriptors were utilized. 18Third, physics-based methods, such as Monte Carlo simulations, molecular dynamics (MD), and first-principles simulations, have been used to reliably predict the reaction energy. 19,20Finally, datadriven techniques have been employed to address the challenges of solubility prediction. 21,22achine learning (ML) has been recognized as an important data-driven method in a wide range of scientific domains, including materials science. 23ML has been successfully used to predict solubility using molecular property descriptors known as extended connectivity fingerprints (ECFP). 18elaney 24 developed a method for predicting solubility that utilized a dataset of organic compounds with known solubility values.This approach relies on molecular descriptors derived from the molecular structure, which is valuable when experimental data are limited.Delaney concluded that parameters such as the fraction of atoms in the ring, molecular weight, and number of rotatable bonds play crucial roles in predicting aqueous solubility.In another study by Delago, 25 quantum chemical descriptors and statistical methods were employed to develop solubility prediction models.The results showed the effectiveness of the descriptors in predicting solubility, highlighting their significance in terms of health implications.
7][28][29][30][31][32] GNNs provide an approach to modelling molecules by representing them as graphical structures.In this representation, atoms act as nodes and bonds serve as edges, allowing GNNs to capture the connectivity and structural relationships within molecules.By harnessing this graph-based representation, GNNs have been shown to be able to predict the properties and characteristics of molecules.Several studies have demonstrated the potential of GNNs for tackling physical problems, including condensed matter physics and highenergy particle physics.For example, Thais et al. 33 and Shlomi et al. 32 employed GNNs to classify the particle interactions in high-energy particle physical.Additionally, Sanchez-Gonzalez et al. 26 and Jaensch et al. 34 demonstrated how GNNs can be used to simulate complex physics systems.Furthermore, GNNs excel in predicting material properties.Ward et al. 35 demonstrated how machine learning algorithms accurately predict the characteristics of both crystalline and amorphous materials.Similarly, Dai et al. 36 demonstrated the power of a GNN model across various microstructures.Recently, Zhang et al. 37 proposed an approach using a GNN with a representation that outperformed traditional machine learning models for predicting material properties.The success stories of applying GNNs in various domains highlight their potential and effectiveness in advancing research and understanding in physics, material science, and other fields.Raissi et al. 38 introduced an approach called physics-informed neural networks (ANNs), which combines deep learning techniques with equations known as partial differential equations (PDEs).This innovative method has been used to address various challenges, including fluid dynamics and electromagnetism.
GNNs are capable of learning and generalizing from molecular graph-structured data, making them well-suited for predicting molecular properties such as redox potential, as demonstrated by Chaka et al. 39 This has important implications for using the GNN model for predicting the aqueous solubility of new materials for energy storage applications.Consequently, the application of GNNs to predict organic molecule solubility is crucial for the development of redox flow batteries with high efficiency and reliability, as well as numerous other uses in various industries.In this study, we used MolGAT 39 together with a commonly used molecular graph format to train the model on the AqSolDB 15 dataset to estimate the intrinsic water solubility of the organic molecules.The aim was not to develop a new modelling framework but rather to predict and classify compounds that could potentially be beneficial as redox-active materials that had already been screened in our previous work.We make several key contributions to this goal.To begin, we performed a thorough comparison of all commonly used atomic feature representations to identify the most appropriate one, and then excluded the atomic type from the feature list by performing a deep analysis of the model's flaws to better learn the underlying molecular structures.Second, we benchmarked the performance of the model against the experimental solubility data from the literature.Furthermore, virtual screening using the trained model was performed by predicting the intrinsic aqueous solubility of the promising redox-active compounds in our previous work.Hence, this effort will increase the usefulness of our screened compounds, which have redox-active characteristics and adequate solubility for energy storage applications, including redox flow batteries.

Methodology
In deep learning, fingerprint-based models provide a compact representation of material structures, thereby facilitating simpler training and data management.The extended connectivity fingerprint (ECFP) also referred to Morgan fingerprints (descriptors) is one of the commonly utilized representations, in this field. 40However, graph-based models are critical for predicting complex structure-property relationships for both condensed matter systems and molecules. 42Graphs are natural way to represent data in various problem areas.These graphs consist of sets of elements that have relationships and interactions, with each other.This natural representation proves to be powerful, in capturing the essence of the data.GNNs excel at learning functions that operate on graphs by including strong connection inductive biases.By incorporating domain knowledge concerning nodes (atoms) into the underlying topography of the artificial neural network model architecture, GNNs provide a considerable advantage. 43GNNs have shown exponential growth in recent years, with numerous applications in a wide range of scientific and technical disciplines. 33,44,45A critical component of the design of Graph Neural Networks (GNNs) is the message-passing system, which allows information to be sent along the graph's edges.This framework has been successfully applied to various graph types including molecular and crystal graphs.Notably, in the case of molecular graphs, message passing neural networks (MPNNs) 46 with edge features that capture bond information are employed.Furthermore, the use of GNNs to crystal structures has been investigated, such as in the example of MEGNet, 41 which combines geometric information and leverages global parameters, such as temperature, which is especially important for solid-state crystalline systems.
The MolGAT model, introduced by Chaka et al., 39 is a type of GNN that learns molecular structures, bond attributes, and atomic properties using attention-based message passing techniques.When compared to other graph-based models, this model has demonstrated promising performance in predicting the redox potential of organic molecules. 39In this particular study, the MolGAT model was trained using the AqSolDB dataset to accurately predict the aqueous solubility of various organic molecules.The mathematical representation of this model is as follows: In eqn (1), the updated feature vector for node i in a particular layer of MolGAT is represented as h (l+1) i .To compute this updated feature, the feature vector h (l) i from the previous layer can be used.In this equation, we sum all the connected nodes j to node i in the graph, including node i itself, and the set of neighbours of node i is denoted as N(i).The weight matrix Y (l) corresponds to the weight matrix of the layer in our model.This weight matrix was learned during training.Determine how the information from neighbouring nodes is combined to update the feature vectors for node i.Finally, we utilize an element nonlinear activation function such as the sigmoid or ReLU, denoted as s(Á).This activation function provides nonlinearity and allows the model to capture interactions between nodes in the molecular graph.
In eqn (1), the term a ij (l) refers to a multi-head attention mechanism that is crucial for dynamically integrating inputs from neighboring nodes and edge characteristics.During the training process, the attention coefficients can be learned to allow the model to focus on the important edge attributes.This significantly enhances the capacity of the model to capture meaningful information from molecular graphs.This attention method works on the basis that each atom (node) does not have equal contributions, allowing it to simultaneously pay attention to many graph aspects.As a result, the MolGAT model provides a complete representation of the molecular graph, making it useful for predicting molecular properties.This attention mechanism can be mathematically represented as follows: This attention mechanism (a ij (l)) in eqn ( 2) provides weighting factors for calculating the attention coefficients of a particular edge (i,j) in the molecular graph of a specific layer (l) connecting atoms i and j.It compares the importance of the edge between atoms i and j with that of other edges connected to atom i.This comparison is based on scalar values a (l) ij and a (l) ik .The values were concatenated and combined using a weight matrix W a .The resulting attention coefficient represents the relative significance of the edge between the atoms i and j.A learnable weight matrix W a is used in this calculation, along with a Leaky Rectified Linear Unit (LeakyReLU) activation function.Attention scores are obtained by taking the dot product between the weight matrix (W a ) and the concatenation of the transformed node and edge feature vectors (Y (l) . The transformation process involves applying linear transformations to the diagonal matrix of learnable parameters (Y (l) ) at layer (l) to the initial node feature vectors (h (l) i and h (l) j ) and the initial edge feature vector (e (l) ij ).These transformed feature vectors are concatenated using the symbol 8 and an exponential function (exp) is applied.

Dataset and preprocessing
To train our MolGAT model, we used the extensive and diverse AqSolDB dataset, which was referenced in ref. 15.AqSolDB is a valuable resource that is freely accessible to researchers, and encompasses 9982 distinct compound categories.To provide robustness in the aqueous solubility data, this dataset comprises a wide range of solubility values and 2D descriptors acquired from multiple sources.We did not use the offered 2D descriptor information because our deep learning model runs on graphs.Instead, we used the RDKit and PyTorch Geometric(PyG), tools to convert the SMILES string representations as indicated in Table S1 and Fig. S1 (ESI †) to a graph format.
The data preprocessing of AqSolDB involved transforming the SMILES strings representing chemical structures into molecular graphs compatible with Pyg, which is a powerful library for deep learning on graphs.Initially, SMILES strings were parsed using RDKit to extract essential information such as atom types, bond types, and connectivity patterns.Subsequently, the extracted molecular representations were converted to a graphical format suitable for PyG.This involves mapping atoms to nodes and bonds between atoms and edges in the molecular graphs.Additionally, graph-level features, including edge attributes and node features, were incorporated to capture pertinent information about the molecular structure and properties, resulting in 30 atomic and 12 edge features, as shown in Table 1.With the construction of molecular graphs and the inclusion of relevant features, the data were prepared for further processing and model training using PyG.This comprehensive preprocessing pipeline enabled the application of graph neural networks and subsequent solubility predictions using the transformed molecular data.

Data splitting strategy and hyperparametrization
The AqSolDB dataset, which provides significant data for solubility prediction, is split into two subsets: training and testing.Partitioning was accomplished using a stratified random sampling procedure.The training subset consisted of 6795(80%) entries, and was used to train the MolGAT model.The testing subset received the remaining 1699(20%) of the dataset which was not used during the training.Instead, it was used to assess the performance and generalization capabilities of the trained model.In addition to splitting the AqSolDB dataset for training and testing, the hyperparameter setup of the solubility prediction model was optimized to improve its performance.This process involves fine-tuning the hyperparameters using Optuna, a hyperparameter optimization framework, and leveraging PyTorch Lightning to streamline the optimization procedure.By systematically adjusting the hyperparameters, we aimed to identify the optimal configuration that would yield the best results for solubility prediction.Various hyperparameters were explored, including the number of fully-connected layers (num_fc_layer), convolution layers (num_conv_layers), attention heads (num_heads), hidden dimensions (hidden_dim), batch size, dropout rate, and training epochs.The best hyperparameter configuration was determined after the hyperparameterisation process was conducted.The optimized configuration included setting the total number of fullyconnected layers to four, the number of graph convolution layers to three, the number of attention heads to four, the number of hidden dimensions to 192, the batch size to 64, the dropout rate to 0.1, and training for 251 epochs.The test results were examined to assess the performance of the model with the best hyperparameter settings.

Model training workflow
The workflow of the proposed approach can be divided into four phases.First, the SMILES notation representing the molecular structure was processed to convert it into a graph representation suitable for the MolGAT model.This involves the extraction of relevant features and the encoding of molecular graphs.Second, the MolGAT model was trained using a training dataset in which predictions were made based on the input molecular features and corresponding experimental log S values.During this step, the model parameters were optimized through iterative training.The prediction performance of the trained MolGAT model was benchmarked using an external dataset containing experimental log S values for molecules that were not used in the training phase.The model predicts log S values for these compounds, which are then compared to the experimental values to assess the reliability and generalization capacity of the model.Finally, high-throughput screening was performed using the trained and optimized MolGAT model.This requires running the model on a large number of molecules to estimate their log S values.This approach can help effectively screen enormous chemical spaces and identify promising chemicals for further experimentation or drug discovery using the trained model.
To predict solubility using a graph-based approach, molecular graph attention networks (MolGAT) were employed to learn molecular embeddings and capture both local and global features of the molecules.This model consists of multiple graph attention convolution layers (MolGATConv1, MolGAT-Conv2, and MolGATConv3) and three fully connected linear layers (Linear1, Linear2, and Linear3).The final linear layer (Linear3) outputs the final prediction of the aqueous solubility.The MolGAT model aggregates global attributes using a readout function, resulting in a comprehensive molecular representation.The final layer is used to predict the properties based on the learned molecular features.This enabled the model to make accurate predictions regarding the solubility of new organic molecules based on their molecular graph representations.In general, the model had 699 793 parameters and generates a scalar output from the final fully connected output(fc_out) layer as shown in Table 2.It also provides a concise summary of the graph neural network layers, layer names, layer output shapes, and their corresponding trainable parameters for predicting the aqueous solubility of organic molecules using the MolGAT model.

Model training performance
Unlike other descriptor-based models, the MolGAT model is a graph-based neural network architecture that predicts molecular properties by considering molecular structures in addition to atomic and bond attributes.This MolGAT model was trained using the AqSolDB dataset to predict the aqueous solubility of organic compounds.The MolGAT model effectively managed the noisy optimization of the high-dimensional features of the graph neural network weights using the adam optimizer.The log S values were standardized to improve the performance of the MolGAT model by calculating the mean and standard deviation of the AqSolDB data.The transformation equation (data À mean)/std was applied to the target variable (log S) to ensure that it has a mean of zero and standard deviation of one.After training, the predictions were rescaled back using the inverse transform equation: (data Â standard) + mean.
The model achieved a mean absolute error (MAE) of 0.322, root mean squared error (RMSE) of 0.432, and R 2 value of 0.98    phase of the MolGAT model.This helps to understand the significance and reliability of atoms and bonds, providing insights into the input graph components that contribute the most to the model's predictions.The plot was generated by calculating the gradients of the model's output for the input graph elements, such as nodes and edges.These gradients capture how the output of the model changes when small perturbations are applied to input elements.By visualizing these gradients, the saliency map effectively highlights the elements that have the greatest impact on a model's predictions.This emphasizes the significance of incorporating molecular structures into GNN models, showing their superior ability to learn from the structure of a molecular graph when compared to other descriptor-based models.
We also evaluated the performance of the MolGAT model by quantifying its effectiveness relative to other models using three evaluation metrics: R 2 , MAE, and MSE.These metrics offer a comprehensive assessment of the predictive capabilities of the model and provide insights into the overall level of observed errors, as illustrated in Table 3.The parity plots of the the benchmarked models can be found in ESI † from

Benchmarking solubility prediction
The aqueous solubility prediction capabilities of the MolGAT model were validated by comparing the predicted log S values with experimental log S values from the literature. 47For this validation, we utilized four solubility datasets based on different solvents: acetone, benzene, ethanol, and water.However, we only focused on the dataset specifically related to water solubility which consisted of 900 molecules in this benchmarking.
After encoding the molecules from this dataset in molecular form with the assistance of RDKit library tools, we obtained a set of non-overlapping molecules when compared against AqSolDB data.A set of randomly selected organic molecules from the external validation data with experimental water solubility was used for the comparison between the experimental log S (mol L À1 ) values (log S expt (mol L À1 )) and the predicted log S values (log S Pred ) generated by the MolGAT model for each molecule is shown in Table 4.The results obtained by applying the MolGAT model to estimate water solubility exhibited remarkable similarity to values reported in relevant literature source 47 with MAE: 0.50 and MSE: 0.42.By randomly selecting samples and their corresponding smile representations solely for water-soluble molecules, Table 4 illustrates both the experimental log S values (log S expt (mol L À1 )) and predicted log S values (log S pred (mol L À1 )).The result of the D log S (mol L À1 ) indicates that the MolGAT model has demonstrated a strong generalization ability when it comes to predicting aqueous solubility for organic compounds beyond those included within training datasets, which is an essential requirement for performing high-throughput screening tasks.

Solulibility prediction comparsion
To comprehensively assess the predictive capabilities of the MolGAT model, an analysis was conducted to compare its performance with those of other established prediction models.This evaluation involved examining and comparing the outcomes obtained from our model against validation data with experimental log S findings documented in the relevant literature. 47By conducting a comparative assessment of Mol-GAT with various descriptor-and graph-based machine learning models, we can accurately scrutinize and identify the optimal model based on their respective log S Pred values.After careful analysis, it is evident that all of the models demonstrate reasonable predictions performance when compared to the experimental values (log S expt ) as indicated in Table 5.
In the comparison provided in Table 5, the water solubility values of various compounds were evaluated using both the experimental data and predictions from the MolGAT, attentive fingerprint(AttentiveFP), message passing neural network(MPNN), and graph attention network(GAT) models.The compounds are identified by their SMILE notation, and the predicted values are presented in the models columns for log S pred , whereas the experimental water solubility values are displayed in the log S expt column.This comparative analysis has two primary objectives.First, we evaluated the predictive performance of our model.This involves assessing the ability to accurately predict outcomes based on the given data.Second, we investigated the influence of utilizing diverse datasets that differ from the training set.By doing so, we aimed to understand how our model performs when applied to different scenarios, specifically in the context of highthroughput screening.In general, the MolGAT model demonstrated a good agreement with the experimental results.In most cases, the predicted log S values closely aligned with the observed log S values, indicating the reliability of the model in making accurate predictions.These findings provide confidence in the effectiveness of our model for filtering potential redoxactive molecules with desirable aqueous solubility, as previously explored in our research.

Funneling molecules with solubility prediction
To identify redox-active molecules suitable for energy storage applications, a two-step virtual screening process was implemented, resembling a funneling approach.The objective is to narrow down the vast chemical space to a select group of molecules that exhibit both redox activity and favorable solubility properties.The initial phase involved utilizing a MolGAT model trained on the RedDB 48 dataset to screen 23 467 redoxactive molecules.This model predicted the aqueous solubility of large organic compounds based on their molecular graph representations.The aim was to identify potential candidates for aqueous redox flow batteries, thereby refining the search space.Building on these results, the second phase focuses on refining the selection further.
In the second phase, a MolGAT model specified in Section 2.3 was trained specifically for solubility prediction using the AqSolDB dataset with Section 2.1 in Table 1.This trained model was then applied to 23 467 redox-active molecules identified in the first phase to assess their solubility properties.As a result, approximately 12 332 promising molecules with both redox activity and favorable solubility characteristics have been identified.By employing this funnelling approach, the virtual screening process effectively filtered out a significant portion of the initial pool, leaving behind a smaller yet highly relevant subset of redox-active molecules with desirable solubility properties.The refined selection serves as a valuable starting point for further experimental investigations and potential utilization in energy storage devices such as redox flow batteries.
Table 6 provides comprehensive information about a collection of screened molecules and their corresponding properties.The table includes columns for SMILES string notations, predicted redox potential, predicted log S values derived from the MolGAT model prediction, and an indication of aqueous solubility (isSoluble) to determine if the molecule is soluble or insoluble.These findings are crucial in assessing the solubility characteristics of molecules and their relevance in various applications.
Information about the number of compounds screened at stages of the virtual screening process using the MolGAT model is shown in Table 7.The first column, titled ''Number of Compounds, reveals that 500 000 compounds were initially considered for screening''.Moving on to the column, named ''Promising Redox Active Compounds Screened'' 23 467 were identified as promising redox-active molecules.This means that these compounds can undergo chemical reactions     involving electron transfer, known as reactions.The third column, labeled Redox Active and Soluble Compounds Screened'' indicates that further refinement resulted in the discovery of 12 332 compounds that did not exhibit activity but also displayed favorable solubility characteristics.These findings highlight a screening approach aimed at narrowing down the pool of compounds by considering both activity and solubility factors.
The bin edges of [ÀN, À6, À4, À2, N] were employed to define distinct log S categories for the screened molecules.Specifically, log S values below À6 were categorized as ''insoluble'' values ranging from À6 to À4 were labeled as ''slightly soluble'' those falling between À4 and À2 were designated as ''moderately soluble'' and values exceeding À2 were classified as ''highly soluble''.These categorized ranges effectively represent the different solubility levels of the screened molecules, as shown in Fig. 3a.After binning and removing duplicates, the examination of solubility (log S) categories revealed that among the 12 332 promising, soluble, and redox-active organic molecules obtained from the two phases of high-throughput screening, 3287 molecules were classified as moderately soluble, whereas 9045 molecules fell into the category of highly soluble compounds shown in Fig. 3b.These findings provide important information regarding the distribution and solubility of the screened compounds within the defined log S categories.Furthermore, the polar and non-polar functional groups also analyzed that may favor or hinder solubility of the screened promising redox-active molecules for further fine-tuning effects as shown in Fig. 4.
The solubility of organic compounds depends on the balance between polar and nonpolar interactions.Polar functional groups generally enhance solubility by forming hydrogen bonds or undergoing ionization.However, nonpolar functional groups decrease solubility by limiting interactions with polar solvents such as water.In Fig. 4a, the prevalence of polar functional groups that enhance solubility in promising screened molecules that exhibit both solubility and redox activity is compared to the number of non-polar functional groups in Fig. 4b, which might hinder the solubility of the screened molecules which resemble the polar and non-polar functionals of AqSolDB in Fig. S5 (ESI †).Furthermore, Fig. 5 showcases randomly sampled screened molecules along with their corresponding log S values, providing additional insight into the solubility characteristics of the dataset.In this study, we developed a MolGAT solubility predictor, which is web application based on the MolGAT model, to accurately predict the solubility of organic compounds.The application can be accessed at https://molgat.streamlit.app.Researchers can input a SMILES string representing the molecular structure and obtain rapid and reliable predictions of the aqueous solubility.The availability of the MolGAT Solubility Predictor facilitates easy access to solubility prediction capabilities and encourages collaboration among researchers in this field.

Conclusions
In this study, the MolGAT model was used to predict aqueous solubility, and its performance was compared with that of common models.The effectiveness of the model for solubility prediction was evaluated by comparing it to other GNN models, such as AttentiveFP, MPNNs, D-MPNN, GAT, and a descriptorbased model (Random Forest).The model, which was created for material property prediction, used the molecular representation learning of the GNN, which includes n-dimensional edge  features, to obtain superior performance compared to other models.To confirm reliability and generalizability of the model, its predictions were tested against experimental log S values from prior studies utilizing a vast collection of solubility data.Furthermore, we discovered that removing atom types from the atomic features had no effect on the performance of the MolGAT model predictions, which were considered during the redox-potential prediction in our previous study.This discovery not only shows the efficiency of the model but also streamlines the training process by lowering the computational complexity.
Two-step virtual screening was used to identify redox-active molecules with favorable solubility properties for energy storage applications.This progressive screening approach allowed us to narrow the pool of compounds by considering both redox activity and solubility.Table 7 provided valuable information regarding the number of compounds screened at different stages using the MolGAT model.In the initial phase, the MolGAT model was trained on the RedDB 48 dataset to screen 23 467 redox-active molecules from an initial pool of over 500 000 compounds in our previous study.In this study, the screening of redox-active organic compounds was enhanced by considering their aqueous solubility (log S), which yielded 12 332 promising molecules with favorable solubility for redox flow battery applications.This ensured that the MolGAT model could significantly reduce organic compound search spaces to screen promising organic molecules for energy storage applications by employing high throughput virtual screening.These carefully screened compounds lay the foundation for further investigation based on computational and experimental approaches.
In conclusion, the advantages of utilizing graph neural networks, particularly the MolGAT model, were demonstrated for predicting the molecular properties of organic compounds by leveraging their molecular graph representations with ndimensional atomic and bond attributes.In addition, the two-step virtual screening process successfully identified redox-active organic molecules with desirable aqueous solubility, making them promising candidates for energy storage applications, particularly in redox flow batteries.Furthermore, this research may contribute to the advancement of energy storage systems for the development of efficient and reliable redox-active materials applicable to aqueous redox flow batteries.Finally, the trained model can predict the solubility of diverse compounds, providing useful insights into medicinal, environmental, and chemical applications.
on the training dataset.It also generalized reasonably well on the validation dataset, with an MAE of 0.393 and RMSE of 0.540 with R 2 of 0.97, as shown in Fig. 1a and b as well as training and validation error plot in Fig. S6 (ESI †).This level of performance suggests that the MolGAT model effectively learns to map molecular graphs to the solubility values.The graph-based architecture allows it to leverage both the structural information (atoms and bonds) and properties of individual atoms.The attention mechanism helps the model to focus on the most relevant parts of the molecule for predicting solubility.MolGAT established its ability to effectively predict the aqueous solubility of organic compounds based on their molecular structures by obtaining low error and high R 2 on both the training and validation sets.Generalization of the model on the test dataset as well as the experimental dataset from the literature suggests that it does not overfit the training data.The saliency map plot in Fig.2shows the impact of the nodes and edges on a molecular graph during the training

Fig. 1
Fig. 1 MolGAT model training and validation performance on AqSolDB (a) the parity plot of predicted against target aqueous solubility values, and (b) the model train and test losses.

Fig. 2 A
Fig. 2 A plot of the saliency map with log S values, for the MolGAT model can show us which nodes and edges have an impact, on a molecular graph during training.
Fig. S8-S12.The findings indicate that the MolGAT model achieves a superior R 2 score and demonstrates lower values for both MAE and MSE, indicating a higher accuracy and reduced prediction errors.The Random Forest (RF) model is a type of descriptor-based model, whereas the rest of the models, such as MPNN (Message Passing Graph Neural Networks), GCM (Graph Convolution Model), Attentive FingerPrint(AttentiveFP), Graphattention network (GAT), and MolGAT (Molecular Graph Attention Network), are all graph-based models.In RF, the descriptors were generated with a Morgan-figure print descriptor with a radius of four (4) and a length of 2048, just as the atomic and bond figures for the graph-based model were generated based on the nature of individual requirements.

Fig. 3
Fig. 3 The screened molecules in the first and second phases with MolGAT (a) Redox-active molecules categorized based on their solubility (log S) from the first-phase screening.(b) Redox-active and soluble molecules screened in both the first and second phases, categorized based on their solubility (log S).

Fig. 4
Fig.4The distribution of dominant functional groups in promising soluble and redox-active molecules screened using MolGAT (a) polar functional groups (b) non-polar functional groups.

Fig. 5
Fig. 5 Randomly selected screened molecules with their corresponding predicted log S values using MolGAT.

Table 1
Node and edge features used to encode molecular graph for training MolGAT model Single, double,triple,aromatic (one-hot encoding) 4 Conjugation Is conjugated (binary encoding) 1 Ring Is a bond part of a ring (binary encoding) 1 Stereo Z, E, any,none,(one-hot encoding) 6 Subtotal 12 Total 42 This journal is © the Owner Societies 2023 Phys.Chem.Chem.Phys.

Table 2
Trainable parameters in the MolGAT model for predicting solubility using the AqSolDB data set

Table 3
Descriptor-based and graph-based models performance to predict log S of organic molecules

Table 4
47edicted and experimental water solubility values with MolGAT model for randomly selected molecules collected from the final external validation dataset47Smiles log S expt (mol L À1 ) log S Pred (mol L À1

Table 5
47edicted and experimental water solubility (log S) with the final external validation data47with MolGAT, AttentiveFP, GAT,MPNN, and RF models

Table 7
Number of promising soluble and redox-active molecules screened in two stages for virtual screening with MolGAT model

Table 6
Randomly sampled screened molecules with their predicted log S values by MolGAT