Defining inkjet printing conditions of superconducting cuprate films through machine learning

The design and optimization of new processing approaches for the development of rare earth cuprate (REBCO) high temperature superconductors is required to increase their cost-effective fabrication and promote market implementation. The exploration of a broad range of parameters enabled by these methods is the ideal scenario for a new set of high-throughput experimentation (HTE) and data-driven tools based on machine learning (ML) algorithms that are envisaged to speed up this optimization in a low-cost and efficient manner compatible with industrialization. In this work, we developed a data-driven methodology that allows us to analyze and optimize the inkjet printing (IJP) deposition process of REBCO precursor solutions. A dataset containing 231 samples was used to build ML models. Linear and tree-based (Random Forest, AdaBoost and Gradient Boosting) regression algorithms were compared, reaching performances above 87%. Model interpretation using Shapley Additive Explanations (SHAP) revealed the most important variables for each study. We could determine that to ensure homogeneous CSD films of 1 micron thickness without cracks after the pyrolysis, we need average drop volumes of 190–210 pl, and no. of drops between 5000 and 6000, delivering a total volume deposited close to 1 μl.


S1. Framework for Machine Learning implementation
Machine learning is a multidisciplinary field comprising elements of mathematics, computer science, artificial intelligence, as well as the discipline of application (materials science in this case). 1 Its goal is to find patterns in existing data through the implementation of a set of tools (e.g. algorithms) to develop models which provide understanding on diverse phenomena and that can be later used to make informed decisions based on data. 2 For example, enabling the selection of best processing parameters to enhance specific physical properties in materials.
The diversity of fields of application for machine learning has led to the definition of a basic framework to implement machine learning tasks that we will describe for our case: 2  Data Collection: consists of acquiring the data from different experiments, as well as characterization techniques. In our case, it comprises the data from inkjet printing variables defined in the Methodology section, as well as the data obtained from optical microscopy images after the deposition that was only used to identify their quality (degree of homogeneity), but not as input in the models. In addition, it must be considered that the range of parameters explored always ensured rather uniform depositions.
 Data Preparation: is the process were the collected data is treated to correct errors such as missing values, inconsistencies, combination or creation of variables, etc. 2   Exploratory data analysis: it is a step performed before building machine learning models and consists in the representation of the distribution of all variables and their relations. This provides a deep understanding of its characteristics, as well as the relations between the different features to be used in the models. The detailed analysis of the drop formation (AV, APL, Amine and ADV) and deposition (dx, dy, NoD and TVD) features employed in this study is reported in Section S3.
An initial exploration of the relations between variables is usually performed by building a correlation matrix which represents the linear relationships between variables, two-by-two, based on the Pearson correlation coefficients ( ) which are defined by: 3 = 2 (4) where is the covariance between two variables, and is the product of the standard deviations 2 for each variable.
indicates the magnitude of the correlation and takes values between -1 and 1.
Values above 0 indicate that both variables are positively correlated, i.e., both variables will increase (decrease) together. On the other hand, for negative correlations, one variable raises (shrinks) while the other diminishes (increases). Finally, when , there is no correlation between variables, and = 0 they are independent. Pearson correlation coefficients are used to select the independent variables that will be used in machine learning models by removing highly correlated variables since the information contained is very much the same. This also helps reducing computational costs without affecting much the precision, especially on large datasets. 4  Train/test splitting and cross-validation. A prior step to the development of machine learning models consists of splitting the data into train and test datasets. This procedure is commonly used in machine learning to evaluate the precision after optimization of model parameters, ensuring that they are able to perform well on unseen data and, thus, prevent overfitting. 5 In first place, we shuffled the data to avoid a bias during splitting due to the way data was introduced in the Excel file used as input. Then, the data was separated into features or attribute variables (X) and the target or predicted variable (Y). Usually, a feature scaling process is performed by standardizing every variable so that the average value is 0 and the standard deviation is 1. This gives the same weights to all variables and prevents the predominance of features with larger values on the model. However, feature scaling is only required when algorithms are sensitive to such variations which is not the case of ensemble methods. Therefore, this was not implemented in our workflow, so that we could obtain information on specific values during the interpretation step described later.
Anyhow, in the dataset, we provide the same code and models, considering feature scaling to demonstrate that similar results are obtained in terms of precisions and feature importance.
Thirdly, the data was randomly split with an 80/20 ratio into train and test datasets, respectively.
A random state was used in both the shuffling and train/test splitting steps to ensure reproducibility of model results. The train dataset was divided again, using k-fold cross-validation (CV) into 5 different subsets using the same 80/20 proportion. 6 This allowed us to evaluate and compare the different models considered on the CV datasets, as well as select and optimize the parameters of the most adequate algorithm before conducting a final evaluation on the test dataset. 5  Algorithms selection. Machine learning algorithms were picked based on the low number of observations from the experimental data. In this sense, we used ensemble methods which work well with small datasets and aim to improve the predictive performance of algorithms, decision trees in our case, by following two different approaches, bagging or boosting, to build our models.
On the one hand, we employed the Random Forest (RF) regressor which is based on the bagging approach by building a large amount of decision trees in parallel on random subsamples of the train dataset and averaging the outputs to obtain the final model. 7 On the other hand, the boosting approach (AdaBoost (AB) and Gradient Boosting (GB) regressors) 8,9 consists of the sequential implementation of different decision trees by fitting them iteratively, aiming to minimize the error at each step and obtain a final model with low bias. The basic difference between AB and GB is that the former updates the weights related to each observation of the train dataset at every step, while the latter updates the values of the observations to be used in constructing the next model. 10 Furthermore, the optimization of tree-based models is very simple in comparison to other algorithms due to the low number of parameters to tune. Finally, we compared the precision of the models developed using ensemble methods with linear regression models which are based on the minimization of the residual sum of squares between the values of the target variable in the dataset, which is used to train the model, and the predicted ones.
combination of a set of simplified input variables in binary form (0=not present, 1=present) by following the transformation: 12,13 These simplified features are mapped with the original data using a mapping function .
is the output value with no input variables, and is the effect of each variable i to a prediction, 0 also known as the SHAP value. The SHAP value can be calculated using the Shapley values formula for a model with a prediction function and F variables: 12,13 Where is the total number of variables, S is any subset of variables from F that does not include the i- The code and Jupyter Notebooks used for the different steps described above, as well as the figures and models is freely available at http://dx.doi.org/10.20350/digitalCSIC/14016.

S2. Descriptive statistics of the dataset characteristics
The experimental data has been classified in the variables that influence the drop formation which are mainly related to solution and equipment parameters, as well as variables that will define the deposition on the substrates. In this section, we will analyze the characteristics of the different features.

S2a. Analysis of drop formation variables
The features involved in the drop formation are the Average Voltage (AV), Average Pulse Length (APL) and Amine (Methodology section of the main manuscript). We will also analyze the Average Drop Volume (ADV) which results from their combination although it could also be considered a deposition variable since it has an impact on its result. The distribution of these attributes was analyzed by representing the data in histograms and box plots ( Figure S1). The values of AV used in the experiments ranged from 95 to 234 V (Table S1) and the data follows a rather normal distribution centered around 150 V, the voltage value with a higher number of counts (60 counts). In addition, most of the experimental voltages lay between 136 and 155 V and have 10 to 20 counts ( Figure S1a). The mean AV value, 149.87 ± 22.01 V, is very close to the median, indicating an evenly distributed variable ( Table S1). The box plot shows that voltages below 100 V, as well as 220 V and above are outliers since they are far and isolated from the main distribution (rhombohedra in Figure   S1b). Therefore, these values will not be significant in building the machine learning models. In addition, most of the voltages used are in the medium range of the working conditions for the nozzles, since the equipment has a maximum applied voltage of 255 V. It is worth noting that the requirement for such voltages is highly determined by the solution, as well as its concentration and viscosity which are highly affected by the percentage of amine added to it. 16 Furthermore, the higher voltage values are attributed to a specific nozzle as seen in the raw dataset file (http://dx.doi.org/10.20350/digitalCSIC/14016). The experimental APLs employed are distributed in a range from 20 to 30 μs (Table S1), although values between 25 and 26 μs with 50 to 80 counts were mainly used ( Figure S1a). Few outliers were found below 24 μs, and at 30 μs ( Figure S1b). This shows that the experimental variation of the APL is not very significant in the dataset. Thus, it should have a lower relevance than the AV in the drop formation and in the design of machine learning models, given the small change in values.
The percentage of amine was explored within the range from 0 to 2.3 % with a mean value of 1.66 ± 0.41 % ( Table S1). The data distribution is highly unbalanced since most solutions containing 2 % or 1.14 % amine with 118 and 76 counts, respectively ( Figure S1a). The remaining percentages employed in the solutions have 10 or less counts. The box plots show that the values of amine employed are in the high range and no outliers are found for this feature ( Figure S1b). Therefore, it can be expected that the percentage of amine will not have a significant contribution to the models.

S2b. Analysis of deposition variables
The variables from the dataset involved in the deposition are the dx, dy, NoD and the TVD (Methodology section of the main manuscript). The ADV could also be considered a deposition variable since it experimentally defines which inkjet printing patterns will be used. Figure S2 shows the distribution of these variables from histograms and box plots.  Table S1). The box plot in Figure S2b shows that one outlier has been measured which corresponds to a value of 10000 drops.
Finally, the TVD was calculated by considering the ADV and NoD, as explained in the Methodology section of the main manuscript. This variable follows a rather normal distribution centered around 1 μl and has a mean value of 0.99 ± 0.20 μl (Table S1)

S3. Linear relationships between variables
In this section, we present the linear relationships between different features that are considered important in obtaining homogeneous depositions without liquid movement and that will later lead to crack-free films after the pyrolysis process. Figure S3a and Figure S3b show scatter plots of the AV and APL with the  Figure S4 displays the scatter plot between dx and dy variables that define the IJP grid size that will be printed on the substrates. We can see that there is some scattering around some data points, particularly at a dx of 50 μm. In addition, the linear relationship between both variables is strong and negative with a correlation coefficient of -0.89 (Figure 2). The reason for this relationship is related to the experimental requirements that enable us to deposit continuous films on the substrates. Other relations could have been obtained if this restriction was not considered, but this would not provide useful information towards our aim to optimize the IJP deposition process.  We can see that there is a large dispersion in the data, which leads to a weak negative linear coefficient (ρ = -0.39) with the dx and a slightly positive (ρ = 0.05) with dy (Figure 2). Figure S5c and S5d show the scatter plots for the AV and APL with the NoD and their linear regressions. In this case, the data is highly spread around the most common values, 136 -155 V for the former and 25 -26 μs for the latter, which leads to weak and negative linear relations (correlations of -0.11 and -0.22) (Figure 2). Figure S5e and S5f display the scatter plots for the Amine and ADV with the NoD and their linear regressions. The former shows scattering for Amine values of 1.14 and 2 %, while it is more evenly distributed in the latter.
Both linear regressions are quite strong and negative with coefficients of -0.47 and -0.62 (Figure 2). It must be considered that variables with low correlation coefficients have a large dispersion which implies a weak or not statistically significant linear relationship of these variables and the NoD.   (Figure 2). Figure S6f and S6g presents the scatter plots for the dx and dy with the TVD and their linear regressions. The data is scattered among a broad range of values, although the dx has more data points on the range between 50 and 90 μm. The linear regression is rather strong and negative for the dx, while it is weak and positive for the dy. It must be considered that the large dispersion in the data for variables with low correlation coefficients means that the linear relationship between them and the TVD is not statistically significant.

S4. Additional information for the model Nº of drops (NoD)
In this section, we present additional information related to the model developed for the variable NoD.
As mentioned in the main manuscript, we have chosen the RF algorithm due to is reliability in preventing overfitting and capturing non-linear interactions. In addition, optimization of RF requires tuning very few hyperparameters. We mainly focused on the Nº of estimators, the Minimum Sample Split and the Maximum Depth and used 5-fold CV to maximize the average R 2 (minimize the average root mean squared error (RMSE)). Then, evaluated these metrics on the validation set ( Figure S7). First, we tuned the Nº of estimators by exploring a range from 0 to 500 with steps of 10 ( Figure S7a) and obtained a maximum R 2 (minimum RMSE) around 30 estimators. The Minimum Sample Split was explored from 2 to 10 with increments of 1 and using the previous value for the Nº of estimators. In this case, the optimum value is found for a Minimum Sample Split of 3 ( Figure S7b). Finally, we optimized the Maximum Depth within a range from 2 to 20 and steps of 1, obtaining and optimum value of approximately 10 ( Figure   S7c). Afterwards, we selected the best parameters and retrained the model on the whole train set (including the validation part). Figure S8 shows these results by also comparing between the RMSE and R 2 for train and test datasets, before and after removing the unimportant variables from the model for the NoD. With the full dataset, both metrics are better for the train set as compared to the test one with RMSE values of 175.84 and 288.83, respectively (Figure S8a), while the R 2 are 98% and 94% (Figure S8b). For the model with only the important variables (reduced dataset), we can see that the train and test RMSE for the NoD are slightly lower, 166.56 and 267.31, respectively, whereas the precision remains the same for the train set (98 %) and increases slightly for the test set (95 %). The reason for the improvement between validation and test scores is likely caused by the averaged metrics over the 5 validation sets that were split from the original train set during cross-validation, while for the test set we have only a single metric.

S5. Additional information for the model Total Volume Deposited (TVD)
In this section, we present additional information related to the model developed for the variable TVD.  Like the NoD case, we enhanced the model parameters, focusing on the Nº of estimators, the Minimum Sample Split and the Maximum Depth, also carrying out 5-fold cross-validation and maximizing the average R 2 (minimizing the RMSE) by evaluating the model on the test set ( Figure S11). First, we tuned the Nº of estimators by exploring a range from 0 to 500 with steps of 10 ( Figure S11a) and obtained a maximum R 2 (minimum RMSE) around 120 estimators. The Minimum Sample Split was explored from 2 to 10 with increments of 1 and using the previous value for the Nº of estimators. In this case, the optimum value is found for a Minimum Sample Split of 3 ( Figure S11b). Finally, we optimized the Maximum Depth within a range from 2 to 20 and steps of 1, obtaining and optimum value of 10 ( Figure S11c).