Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties

With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.


Introduction
Recent advances in materials science have seen a plethora of research into application of machine learning (ML) algorithms. Much of this research has focused on supervised ML methods, such as random forests (RFs) and neural networks. More recently, authors have laid out the best practices to help unify and progress this field [1][2][3][4] .
Data representation can play a large role in the performance of ML algorithms; however, optimum choice of representation is not always apparent. In materials science it is often difficult to choose an appropriate representation due to variability in the ML task and in the nature of the chemistry, composition and structures of the materials studied. Additionally, some properties of a material, such its crystal structure in the case of crystalline materials, may not been known until its synthesis. Accordingly, many studies derive representations from either the ratios of elements in the chemical composition, or from domain knowledge-based properties (referred to as features) of these elements, or both, in a process called "featurisation.".
Given the ubiquity of featurisation methods such as those presented here in materials applications, it is important to evaluate the statistical advantage of specific feature sets 5 . Section 2.1 overviews different featurisation techniques and how their effectiveness has been previously reported. We expand on this evaluation in section 3.1, in which seven representations are investigated across five case studies from the literature to explore how these representations perform in published ML tasks. These cases thus represent practical applications, rather than constructed tasks. Each of these representations is also compared to a random projection of equal size to establish the performance benefit of domain knowledge over random noise.
Evaluating the generalisability of ML models is a known challenge across data science, and is of particular concern in materials science, where data sets are of limited size compared with other application areas for ML, and often biased towards historically interesting materials or those closely related to known highperformance materials for certain performance metrics. Typically, models are evaluated on test sets separate from their training data, through a consistent train-test split or N-fold cross validation. However, this does not consider skew in a dataset. In chemical datasets, families of promising materials are often explored more thoroughly than the domain as a whole, which introduces bias and reduces the generalisability of ML models because the data they are trained and tested on are not sampled in a way representative of the domain of target chemistries to be screened with these models.
Leave one cluster out cross validation (LOCO-CV) was suggested to combat this 6 , using K-means clustering to exclude similar families of materials from the training set to measure the extrapolatory power of an ML algorithm (its ability to predict the performance of materials with chemistries qualitatively different from the training set). The value of such an approach can be seen in the case of predicting new classes of superconductors. One may choose to remove cuprate superconductors from the training set, and if an ML model can then successfully predict the existence of cuprate superconductors without prior knowledge of them, we can conclude that that model is likely to perform better at predicting new classes of superconductors than a model which could not predict the existence of cuprate superconductors. LOCO-CV provides an algorithmic framework to measure the performance of models on predicting new classes of materials by defining these classes as clusters found by the K-means clustering algorithm. Application and implementation of this algorithm is discussed further in section 2.2.1.
While differences in cluster sizes in this domain are expected, it has been observed that clusters found with K-means can differ in size by orders of magnitude 7 , which can pose a practical challenge to adoption of this method. With such differences in cluster size, LOCO-CV measurements can represent the performance of an algorithm on a small training set rather than the performance of an algorithm in extrapolation. As representation plays a role in clustering, it is pertinent to investigate the issues of representation and clustering together, even though the representation used in clustering does not need to be the same as that used to train the model ( fig. 3) In section 3.2 we investigate how representations can affect measurements made with LOCO-CV. Kernel methods (also known as kernel tricks, or kernel approximation methods), can be used to non-linearly translate data into a data space that can then be linearly separated ( fig. 2). We apply kernel methods such as the radial basis function (RBF) to chemical datasets to improve the linear separability of data and reduce variance between cluster sizes and thus increase the validity of LOCO-CV measurements (figs. 5 and 8), thus enhancing the assessment of performance found when using different representations as well as assessment of model performance as a whole.
LOCO-CV evaluation is affected by representation of a compound and, conversely, choice of compound representation is affected by the methods used to evaluate these representations. Thus, it is pertinent to investigate these two issues simultaneously. We improve the utility of LOCO-CV measurements by using kernel functions to create a more separable data space, and use these measurements to evaluate featurisation methods using practical supervised ML tasks found in the literature. The key contributions of this paper are as follows: • Further comparison between composition-based feature vectors by comparing performance measured when using different featurisation methods on practical tasks (explained further in section 2.1 before being carried out in section section 3.1).
• Examining the effectiveness of random projections as featurisation methods, as a baseline to justify more involved featurisation methods against (explained further in section 2.1.2 before being carried out in section 3.1).
• Novel application kernel methods to materials data (explained further in section 2.3 before being carried out in section 3.2).
• Studying the effect of kernel approximation functions on application of K-means clustering to materials data and present-ing a workflow to incorporate these methods into the LOCO-CV algorithm (section 3.2).
• We recommend using RBF when clustering for LOCO-CV, as clusterings found after application of RBF are seen to be more even in size than with no kernel method applied, and to give more reliable model convergence. This helps to reduce the risk that performance differences on predicting an unseen cluster of data are caused by the training set size as opposed to the intrinsic inability of a model to perform well on that cluster of data.
• We note that use of the radial basis function (RBF) in clustering for LOCO-CV results in models converging (i.e. learning trends from the data to be able to make predictions with some degree of reliability) more often than when using LOCO-CV without any kernel methods.
• We suggest that random projections are used as a baseline against which to compare engineered feature vectors, noting that commonly used CBFVs have little to no advantage over random projections in most tasks tested here.
• We experiment with the use of random projections in clustering for LOCO-CV, finding them to have no clear advantage over other CBFVs tested in this task.

Common representations used in machine learning in inorganic chemistry
ML algorithms require a consistent definition of a data point in order to analyse trends within a dataset. For example, it would be hard to learn from a dataset in which "a data point" may refer to a phase field, a specific crystal structure, or a composition. One such algorithm is RFs, which are widely used in materials science as well as other domains 8 . They are fast to train, readily implemented 9 , and see a good performance in a plethora of tasks without hyperparameter tuning. We use RFs for our investigations for reasons outlined above, however good evaluation methods for fixed dimensional representations of materials are also important for the plethora of other ML algorithms that use such representations as basis for predictions.
Representation learning, and feature engineering are the two main preprocessing methods to make data more interpretable to ML algorithms. Representation learning is a fast-evolving field that uses deep learning in order to create representations, while feature engineering involves defining a set of features (or descriptors) for a data point that adequately encapsulates all information needed 10 .
Feature engineering has been used extensively in inorganic chemistry and materials science. However, no set of features has emerged as the clearly dominant representation for a material, likely due to the variety of tasks carried out in these domains, which may require different input representation. Many of these representations use only composition-based information (rather than structural), as this allows screening of materials without need for DFT calculations or synthesis, greatly reducing costs associated Application of aggregation function to each property of a material will result in a fixed sized vector for each aggregation function, these are then concatenated together (merged sequentially) to form the final CBFV. Both the properties in the CBFV and the list of aggregation functions can be changed to create variants of CBFVs, which may influence algorithms that use the resulting CBFV. (b) Calculation of the weighted sum of properties of a material. This is equivalent to the matrix multiplication of the fractional representation of that material and its properties. (c) Calculation of a random projection. Using random projection to (approximately) linearly project a representation into a different number of dimensions (N). The original M dimensional representation for our purposes may be a fractional representation for the chemical composition of a material, but this technique can be used for any input data, in domains outside of chemistry.
with such screenings. Composition-based screening is less powerful than the incorporation of structure, as both structure and composition control properties, but more general as structural information is not required and is less widely available than composition (as structure is not known until the material is realised by synthesis, whereas compositions can be proposed without knowing structure). Composition-based feature vectors (CBFVs), which offer a list of compositional attributes of a material, and a one-hot style (also called fractional) encoding of composition 11 , are widely used composition-based representations.
Notable CBFVs including Magpie, Oliynyk and JARV IS 12-14 (differences between which are discussed further during section 3.1) were recently investigated and found to provide benefit over onehot style representations. This benefit was measured using neural networks predicting numerous properties, however the benefit became little to none as the dataset size increased above 1000 points 5 .
We further the investigation into the use of CBFVs by examining their applicability in five case studies. Namely, we examine performance using Olinyk, Magpie, and JARVIS, a variant of random projection of size 200 (discussed more in section 2.1.2) used in the previous review on this topic 5 , as well as one-hot style encodings of composition, and random linear projection of the composition. The performance of RFs using different representations are compared on ML tasks found in the literature, using the relevant datasets for each study [15][16][17][18][19] .
The representations were chosen as they are commonly used, and as these are the non-structural representations investigated for their efficacy in neural networks in previous work 5 . Seeing whether previous results hold for RFs should help gauge whether these results could be used as rule of thumb for many ML algorithms or whether these conclusions should only be applied to neural networks similar to those used in that study.

Can implementation details in CBFVs affect performance
It is common for a CBFV to be comprised of a list of elemental properties that are combined using several "aggregation functions", for example the weighted average, and standard deviation of various elemental properties in a compound ( fig. 1a). The aggregation functions of a CBFV can vary between implementations 5,12 . Using different numbers of aggregation functions results in representations of different lengths ( fig. 1a), which may affect ML performance depending on the algorithm being used.
Problems associated with building statistical models using increasingly large data representations without also increasing the number of data points are well documented, often being described as the curse of dimensionality 20 . Strong correlation between different dimensions (known as co-linearity, or cross correlation between dimensions) can also impact model performance. For example, RFs are affected by co-linearity between dimensions as RF's random bagging process is unlikely to select a subset of features that include none of a set of cross corelated features. This would make the information in features with such cross-corelates more likely to be available to discriminate with at any branch in a tree, compared with those features without such cross-corelates. It is intuitive that different aggregation functions may be crosscorrelated, for example the maximum atomic weight of an element in a compound is likely to correlate with the average atomic weight of an element in that compound, thus RFs may be affected by additional aggregation functions. Without investigation, it is unclear what effect different aggregation functions will have on algorithm performance. Interrogation of the repository associated with the previous review of featurisation methods indicates use of the weighted average, sum, range, and variance of each feature 5 . This includes the features of the fractional (one-hot style) representation, which uses only the ratios of each element in a material in its definition. This implementation difference could affect the performance of a model that uses these representations, so we distinguish between the two, using "fractional" to refer to a one-hot style encoding that includes the average, sum, range, and variance of each element and "CompVec" (for composition vector) to refer to an implementation of one-hot style encoding which contains just the ratios of elements in a compound.
The nature of the fractional representation means that a given compound would contain the same representation three times, scaled by different amounts (depending on the number of elements in the compound) in a single vector (four times if elements in a compound are in equal ratios). This can be exemplified by examining a simple composition such as NaCl (table 1). . This offers an opportunity to investigate how increasing dimensionality (the number of dimensions) of a representation while adding no new information affects performance. We leave the investigation of the effect of information added by different aggregation functions on different feature sets to future work. We experiment using both a (CompVec) one-hot style encoding as proposed for use with ElemNet 11 (with no additional aggregation functions), and the one-hot style approach used previously that includes different aggregation functions (fractional) 5 , to see how this increase in dimensionality above will affect experiments.
While this increase in dimensionality will be seen to affect the clusterings found with K-means clusterings, for most tasks investigated there was not an appreciable difference between CompVec and fractional representations. In band gap prediction tasks fractional representation outperformed CompVec, however in regression tasks relating to bulk metallic glass formation this trend was reversed ( fig. 4).

Random Vectors as featurisation methods
Each elemental property (for example covalent radius) aims to bring with it some sort of information about that element. That property's inclusion in a feature set aims to improve an ML algorithm's performance in a given problem. Every feature included either means an increase to the dimensionality of a CBFV or the exclusion of an alternative feature. Though the importance of a feature to an ML model can be measured 21,22 , it is hard to take such measures of feature importance out of the context of the model that is trained with it, or the dataset that the model is derived from 23 .
As it is hard to distinguish the effects of dimensionality of a representation from the effects of the information imbued in it, Murdock et al. introduce a set of vectors, one for each element each consisting of 200 random numbers to represent nonsensical elemental properties. From these vectors, they derive the CBFV RANDOM_200 to represent a lower bound for feature performance. That is to say; rather than using features that would be expected to give information about an element (covalent radius, atomic number etc.), they instead assign each element a vector of random numbers. If these random numbers can result in a well-performing model then whether the chemically-derived features that are commonplace in the literature are justified can be called into question. When the aggregation function is a weighted sum (discussed further in section 2.1.1), this has the same effect as a matrix multiplication of the one-hot style encoding of a compounds formulae, C, (referred to in this paper as CompVec), and a random matrix, R which can be noted as C · R ( fig. 1b). Thus the weighted sum part of the RANDOM_200 can be seen as a matrix multiplication of the random vectors and the fractional encoding of the composition.
This matrix multiplication is similar to that used in a random projection. Random projection is a dimensionality reduction technique that uses the observation that in high dimensions random vectors approach orthogonality 24,25 . When the columns of R are normalised to be unit vectors, C · R becomes an approximately linear projection of C. Another way to closely approximate normalisation of the columns of a random matrix, such as R, is to sample the values of that matrix from a gaussian distribution of mean 0 and variance 1 N (∼ N 0, 1 N ) where N is the size of the projection. This is mathematically justified by the Johnson-Lindenstrauss lemma, which states that for a set of N dimensional data points there exists a linear mapping that will embed these points into an n dimensional data space while preserving distances between data points within some error value, ε. This value of ε is shown to decrease as n increases 26 RANDOM_200 samples from ∼ N (0, 1) also included aggregation functions (namely sum, range, and variance) 5 , as discussed in section 2.1.1. It is unclear what impact this will have however preliminary investigations show little difference in performance between sampling from ∼ N (0, 1) and ∼ N 0, 1 N . We investigate the use of random projection as an alternative to more widely used techniques by comparing each technique investigated to a random projection of the same size ( fig. 4).This should allow us to note improvements made by the quality of fea-tures as opposed to the quantity. We include RANDOM_200 in this investigation, noting the key difference between this and the random projection being that the random numbers are drawn from different distributions (as outlined above) and that RANDOM_200 includes aggregation functions, where a random projection does not.

Training methods for materials science
Performance metrics are usually applied to a test set of data unseen by a model. Where data are scarcer, or computation time is not limiting, N-fold cross validation can be used. This is often referred to as K-fold cross validation but we use N to avoid confusion with K-means clustering, a more central algorithm to this work. Nfold cross validation randomly splits data into N equal sized random "folds", N models are then trained, each model trained on all but one of the folds of data, and evaluated on the fold which is held out. Performance is then averaged. A common criticism of supervised ML in materials science is that datasets being worked with are inherently biased. Bias in data is a problem more broadly in ML research. In this field, exploration of similar, promising chemistries for particular applications leads to areas of the chemical data space being more dense with successfully synthesised (or DFT calculated) materials than others.
This leads to inflated performance metrics as performance can only be measured against other compounds that have already been synthesised (or compounds with relevant DFT calculations), as many such compounds in the test set will have similar chemistry to the training set. This can lead to comparatively poor results when trying to extrapolate to predict properties for chemistries dissimilar to those that the algorithm has been trained on. For example, it could be argued that the entries in ICSD reflect a bias towards the development of both analogues of the chemistry of minerals and chemistries lending themselves to specific types of application performance, rather than an isotropic exploration of chemical space constrained only by the inorganic chemistry of the elements themselves. Such considerations emphasise the importance of discovery synthesis that accesses new regions of chemical space, as the resulting materials can contribute to more robust models. Having robust methods to measure model performance is pertinent for materials discovery to assess likely model effectiveness in extrapolating to unseen areas of the input domain.

Leave one cluster out cross validation (LOCO-CV)
A method to measure the extrapolatory power of an algorithm was proposed in leave one cluster out cross validation (LOCO-CV). LOCO-CV alters N-fold cross validation to have each fold contain materials in the same cluster rather than randomly selected (equal sized) folds, in order to emulate performance on unseen classes of materials.
Clusterings are selected using the K-means clustering algorithm 27,28 , which infers K clusters ithout the need for target labels. This is done by grouping data into clusters based on their Euclidean distance to K randomly chosen "centroids". The centroids are then redefined as the mean of all points in a cluster and the data are regrouped based on these new centroids. This process is repeated until the positions of centroids (or the contents of their associated clusters) converge. K-means is quick, robust and readily implemented 9 LOCO-CV does however leave representation as a hyperparameter to the clustering (i.e., changing the representation will change the clusterings found with K-means clustering), and that the stochastic nature of the K-means algorithm can make measurements hard to reproduce without publishing the clusters found. A further consideration in use of LOCO-CV is that K-means does not guarantee the size of any clusters, nor does it guarantee that clusters would be deemed chemically sensible (this is discussed further in section 2.4). It has been observed that clusters taken on materials data can vary in size by multiple orders of magnitude, which hinders the application of LOCO-CV 7 .
While different sizes of clusters are to be expected in this domain (for example due to research bias in the generation of example materials), should the sizes of the clusters found in LOCO-CV differ by orders of magnitude then LOCO-CV's ability to measure extrapolatory power is hampered. Intuitively if one of ten clusters contains 90% of the materials in the dataset, then a measurement made with this cluster left out may give a measurement of algorithmic performance given a small fraction of the available training data, rather than indicating extrapolatory power. K-means clustering by its nature can only linearly separate clusters in a given data space. Clusters that are more distinct from one another are more likely to be isolated than clusters of data points that overlap with each other. There are other clustering algorithms, such as t-distributed stochastic neighbour embedding 29 , agglomerative clustering 30 , or DBSCAN 31 , that could be explored for LOCO-CV applications on materials datasets. We measure the separability of clusters of compounds in materials science datasets with K-means clustering.

Kernel Methods
While uneven cluster sizes do pose problems for LOCO-CV assessment of the extrapolatory power of ML models, such issues with K-means clustering are not solely found in materials science. Kmeans clustering attempts to linearly separate clusters (i.e. draw a straight line between them), some clusters cannot be separated this way ( fig. 2). In many cases, applying a non-linear function to every point in the dataset transforms the data in such a way that clusters can be linearly separated 32 . Functions used to preprocess data in this way are called kernel methods (also kernel approximation methods or kernel tricks). Prominent examples of this include RBF 32 , additive χ 2 33 , and skewed χ 2 33 . We look at the first of these in more detail to illustrate how such kernel methods affect data points. The RBF can be defined as: Where γ is a hyperparameter which was set as 1 throughout this study (1 is the default for this hyperparameter in the library used). Here x ∈ D where D is a dataset of materials each represented by a feature set R n where n is the dimensionality of the feature set. Examination of this formula lends intuition to the effects seen in its application ( fig. 2), but also highlights that this function does distort the geometry of an input data space. Thus, some analysis of the results of this function are inappropriate, such as inferring meaning from changes in distances between specific points. Despite these potential caveats, non-linear transformations (e.g., through application of kernels) are frequently used with linear discrimination (such as K-means clustering) 32 .In this paper we investigate the effect of kernel methods such as RBF on materials science data, specifically studying the use of such methods to improve suitability of LOCO-CV by addressing the problem of uneven cluster sizes outlined in section 2.2.1. We find RBFs reduce the variance of class sizes in a clustering, regardless of input featurisation and note that this results in more reliable model convergence when using these clusterings for LOCO-CV.

Performance metrics in K-means clustering
Without prior knowledge of expected clusters for each data point, results found with K-means clustering are difficult to interpret, though expert inspection can yield insights into what different clusters can represent. Expert inspection of results may be justifiable with less than 10 clusters (each of which could have thou- sands of materials), however, when using K between 2 and 10 (as was originally proposed 6 ), the LOCO-CV algorithm presents 54 different clusters (∑ 10 n=2 n), making such expert inspection infeasible. Thus metrics must be used to quantify the success of a clustering.
Where target labels exist, metrics such as mutual information score, homogeneity, and completeness scores can be used. Without labels, Euclidean distance-based measures such as sum-squared distance to cluster centroid or average distance between each point and the other points in its cluster can be used, however this does not intrinsically tell us how much information is in a clustering, just how tightly packed a cluster's members are. The average distance between each point and the other points in its cluster is computationally prohibitive so will not be used in this study.
Euclidean distance-based measurements such as these lack comparability in our use case, as each dataset and each featurisation technique should be considered independent. Identifying trends in these measurements with different numbers of clusters and looking at the effect of kernel methods on Euclidean distance-based measurements are both valid uses. However, as Euclidean space is affected by dimensionality, it is important that conclusions into the effect of different featurisation approaches are not drawn from such measures. While noting these caveats, we use the mean distance of a point in a cluster to the cluster's centroid as a measure of how tight the clusters are in Euclidean space, we label this metric the spread of cluster.
As the aim of this investigation is to improve the validity of measures taken with LOCO-CV, specifically to address issues with vastly uneven cluster sizes, we also use the standard deviation in cluster sizes as a metric for success (the unevenness in cluster sizes). Material science datasets may have uneven cluster sizes due to research bias towards exploration of promising materials, and identically sized clusters would be unexpected for materials data, identically sized clusters were, in practice, never observed in this study. Using the unevenness of cluster sizes serves as a measure of whether cluster sizes differ by many orders of magnitude, which would affect the validity of measurements taken using LOCO-CV. This does not imply that more even clusters are more chemically sensible groupings of materials, just that they may be more sensible for use with LOCO-CV, as uneven cluster sizes bring into question measurements taken with LOCO-CV (section 2.2.1).
Unlike spread of cluster, it is valid to compare standard deviation in cluster sizes between featurisation techniques, however different datasets would be expected to differ in their ease of clustering. As such, we perform max-min normalisation across different featurisation techniques and numbers of clusters in the same dataset, and use these normalised figures to compare between datasets.

Effect of representation on predictive ability of random forest: Case Studies
We examine five case study publications' datasets to compare the representations used in them with a non-structural CBFV examined in previous work 5 , and with the composition vector (CompVec) suggested for use with ElemNet 11 . Case studies have been selected to incorporate the prediction of a variety of material properties, research groups, and notable works that reflect the state-of-theart. We use the original datasets to replicate studies, but use 80-20 train-test splits.
We use a consistent 80-20 train-test split across all data sets to enable us to draw conclusions about which representations work better generally. This should help us to establish whether previous findings (i.e. that domain knowledge is more beneficial in smaller datasets and that benefit diminishes as dataset size increases over 1000) 5 , hold true for RFs. LOCO-CV measurements for these experiments are available in the supporting information, and the clusterings found for LOCO-CV are available in the associated git repository 34 .
Representations compared are: • Oliynyk 13 . Originally designed for prediction of Heusler structured intermetallics 13  • JARVIS 14 : JARVIS combines structural descriptors with chemical descriptors to create "classical force-field inspired descriptors" (CFID). Structural descriptors include bond angle distributions neighbouring atomic sites, dihedral atom distributions, and radial distributions, among others. Chemical descriptors used include atomic mass, and mean charge distributions. Original work generated CFIDs for tens of thou-sands of DFT-calculated crystal structures 14 , and subsequent work adapted CFIDs for individual elements to be used in CBFVs for arbitrary compositions without known structures (i.e. fig. 1a) 5 .
• Magpie 12 : While the Materials-Agnostic Platform for Informatics and Exploration (MAGPIE) is the name of a library associated with Ward et al.'s work, it this has become synonymous with the 115 features used in the paper and as such we will use Magpie refer to the feature set. These features include 6 stoichiometric attributes which are different normalistion methods (L P norms) of the elements present. These capture information of the ratios of the elements in a material without taking into account what the elements are, 115 elemental based attributes are used, which are derived from the minimum, maximum, range, standard deviation, mode (property of the most prevalent element) and weighted average of 23 elemental properties including atomic number, Mendeleev number, atomic weight among others. Remaining features are derived from valence orbital occupation, and ionic compound attributes (which are based on differences between electronegativity between constituent elements in a compound).
• RANDOM_200 5 : a random vector featurisation used by Murdock et al. to represent a lower bounds for performance.
• fractional 5 : An implementation of a one-hot style encoding of composition which includes average, sum, range, and variance of each element.
• CompVec a one-hot style encoding of composition as used in ElemNet 11 (containing only the proportions of each element in a composition). Differences between this and fractional are further discussed in section 2.1.
We compare each of these representations to a random projection of equal size. This allows us to control for the size of a representation when investigating the advantage of the domain knowledge built into a CBFV. Several of the five case studies investigated contain multiple applications of ML within a single publication. The tasks which were recreated in this comparison (and their relevant case study references) are as follows: • T c : Using a regressor to predict the superconducting critical temperature (T c ) of a material (12666 data points in training set) 15 .
• T c > 10K: Classifying if the T c of a material is greater than 10K (12666 data points in training set) 15 .
• T c |(T c > 10K): Regressing to find T c given T c > 10K K (4833 data points in training set) 15 .
• HH stability: Predicting the stability of half-Heuslers (8948 data points in training set) 16 .
• E gap (oxides): Predicting the band gap of oxides found in the Computational Materials Repository database (599 data points in training set) 18 .
• Glass Forming Ability (GFA): predicting the ability of a bulk metallic glass alloy (BMG) to exist in an amorphous state (5051 data points in training set) 17 .
• D max : Predicting the critical casting diameter of a BMG (4724 data points in training set) 17 .
• ∆T x : The supercooled liquid range of a BMG (495 data points in training set) 17 .
• E gap (DFT): Predicting the band gap of materials calculated using DFT (35653 data points in training set) 19 .
• E gap (exptl): Predicting the band gap of materials measured experimentally (1986 data points in training set) 19 .
• E gap (DFT) ∪ E gap (exptl): Predicting the band gap of a dataset consisting of both DFT calculated and experimentally measured band gaps (37639 data points in training set) 19 .
We report measured performance in regression tasks was using r 2 correlation and classification task performance is measured using accuracy. Thus percentage improvement over random projections can be considered to be: Where y is the target label for a prediction,ŷ is the label predicted by a model that uses a given representation,ŷ p is a label predicted by a model that uses a random projection of equal size to the given representation, and M is accuracy for classification tasks and r 2 for regression tasks. Measurements found using other values of M can be found in the supplementary information.
Overall, recreation of these tasks shows that, broadly, changes in CBFV made little difference to performance when compared to a random projection of the same size ( fig. 4). Featurisation methods inspired by domain knowledge do show advantages in some datasets. These advantages seem to be task-specific as opposed to based on dataset size, specifically band gap-based tasks seem to see benefit from knowledge-based features, however most other tasks do not see noticeable improvement from this feature engineering ( fig. 4). This could be because vast amounts of band gap data can be acquired through DFT calculations 35 and as such band gap prediction is a widely available benchmark that researchers could use when testing a newly proposed CBFV 36 .
Intuition may suggest introducing more dimensions that do not contain any additional information would result in worse algorithmic performance. However, despite having 68% more dimensions, RANDOM_200 performs within 5% of the fractional representation. On large enough data sets (∼ 3000 < n) the random representation does not perform appreciably differently to the Magpie representation. Notably on tasks outside of band gap prediction there is little advantage to domain based representations over a random projection.
We encourage the use of random projection as an alternative to CBFV, and propose its use as a comparative measure against CBFV. If a feature set cannot appreciably outperform a random projection of the same size or smaller, then, while there may still be benefits to analysis of the feature importance of such a feature set, that feature set does not enrich the representation of a material when it comes to algorithmic performance.

Improving the linear separability of chemical data spaces for more applicable measurements of extrapolatory power
We investigated which of the representations of a compound outlined in section 3.1 will lead K-means clustering to identify more evenly sized clusters in different datasets. Datasets investigated were those used in section 3.1 as well as the inorganic crystal structures database (ICSD) as a whole.
In classical computer science problems, non-linear kernels have been applied to datasets on which a linear discriminator (such as Kmeans, or support vector machines) exhibits poor performance. As described in 2.3, applying a non-linear transformation (e.g., a kernel function) to every data point in a data set can transform data such that it is more amenable to linear discrimination ( ( fig. 2). We applied the radial basis, additiveχ 2 , and skewedχ 2 functions to the investigated representations to see if these non-linear translations will reduce cluster size unevenness found by K-means clustering. Reduced cluster size unevenness found with K-means would improve the applicability of LOCO-CV measurements, addressing one of the problems highlighted in section 2.2.
As additiveχ 2 , and skewedχ 2 functions are only well defined for positive inputs, data was scaled between 0 and 1 using minmax normalisation before these methods were applied. As RBF (and K-means without kernels) can be affected by disparity of scale between axis to check to investigate the effects of different normalisation methods were investigated, with the data normalisation which most often resulted in the lowest cluster size uneveness being used for the results below (no normalisation was used with RBF and min-max scaling to between -1 and 1 was used when no kernel method was being applied). Further details of this can be seen in section S1 of the supplementary material.
All three kernel functions investigated resulted in more evenly sized clusters than no kernel function being applied at all, with RBF, on average, resulting in the largest reduction in standard deviation between cluster size ( fig. 5). Additionally, we note that application of any of these kernel methods generally resulted in a reduction in distance between points in a cluster and their centroids (spread of cluster), indicating more tightly packed clusters ( fig. 6b). On average application of skewedχ 2 saw the greatest reduction in spread of cluster. As this investigation looks to create more even cluster sizes for use with LOCO-CV we focus on impacts of RBF, as, of the kernel methods tested, it resulted in the greatest impact on this metric as defined by the largest reduction in standard deviation of cluster size.
Before application of a kernel function, we note that cluster sizes are more even in domain knowledge-based representations as measured by the standard deviation in cluster sizes. CompVec representation resulted in a larger standard deviation between cluster sizes (i.e., less evenly sized clusters) than all other representations investigated, likely due to the sparse nature of this representation, with the magpie representation resulting in the most even cluster sizes (fig. 7a). The two one-hot based representations, fractional . Application of kernel methods reduces the spread in Euclidean space within a cluster. This effect is most pronounced with skewed χ 2 and RBF. (c) To visualise these results, PCA was used to generate the first three principal components of all compositions in the ICSD featurised using a CompVec. Colours correspond to clusters found by K-means (k=5) clustering on this representation. Inspection of these clusters reveals highly anisotropic clusters with no meaningful boundaries in the data to unambiguously separate clusters. (d) The first three principal components found when examining an RBF translation of the ICSD (featurised using CompVec), points are coloured according to clusters found by K-means (k=5) applied to the kernelised data. The application of an RBF (as defined in section 2.3) to every composition vector in the ICSD (before clustering) leads to clusters that are more isotropic with more clearly resolved boundaries between clusters. and CompVec, generally did not result in as even cluster sizes as other representations. Application of CompVec resulted in performance substantially worse than that of fractional despite them being very similar nature, only differing in use of aggregation functions (as discussed in section 2.1). RBF universally resulted in more even clusters. The smallest change (as a percentage of the standard deviation in cluster size before application of RBF), was seen in fractional and CompVec representations (two of the representations which resulted in the worst performance in this metric) ( fig. 6a). However, outside these two representations, the proportional impact of RBF on this measure did not correlate to the performance of a CBFV in this measure prior to application of RBF.
Without use of kernel functions, there is a clear correlation between the size of a representation and the spread of the clusters found using that representation, with the exception of CompVec, which saw the tightest clusters ( fig. 7b). This trend is no longer seen after application of RBF. Application of RBF to a CBFV before K-means clustering reduced the spread of clusters found (figs. 6b and 7b). The relative size of the change seen after application of RBF correlated with the spread of clusters found when no kernel method was used. The higher the spread of clusters found using a CBFV without a kernel method, the larger the change seen when clustering using that CBFV and a RBF.
Use of kernel methods in featurisation results in more even cluster sizes when using that featurisation for K-means clustering. As featurisation used for clustering in LOCO-CV is independent of that used for learning, incorporating these kernel methods into LOCO-CV is simple and applicable regardless of machine learning algorithm, chosen metric, and initial representation ( fig. 3). Thus we recommend use of kernel methods when using K-means clustering for LOCO-CV to address the issue of uneven cluster sizes (as discussed in section 2.2). Addressing this issue results in models being more successful at converging using LOCO-CV ( fig. 8), and adds applicability to measurements taken with LOCO-CV, allowing for better measurements of the extrapolatory power of an algorithm, which is of particular importance in materials science.

Clustering Random Projections with and without kernel methods
Having established that random projections perform similarly to engineered feature vectors in many task (section 3.1) and that kernel methods can be used to reduce cluster size variance in Kmeans clustering on materials datasets (section 3.2), experiments were carried out to measure the cluster size variance of random projections of compositions both with and without application of kernel methods. Without application of kernel functions, when each CBFV was compared to a random projection of equal size ( fig. 9a), using random projections of composition vectors did, more often than not, result in more evenly sized clusters than CompVec, but less evenly sized clusters than all other CBFVs investigated. However, no representation (either random projection or CBFV) universally resulted in more even clusters. Comparing the best performing size of random projections (88 dimensions) with other CBFVs without any kernel methods did narrow the differences in cluster size un-eveness ( fig S2b), however other CBFVs still outperformed random projections in several datasets.
Radial basis, additive χ 2 , and skewed χ 2 functions were applied to these projections before clustering using K-means. The resulting clusters were compared to those found without any kernel methods, showing that RBF and skewed χ 2 did reduce cluster size unevenness ( fig. 9b). However, these results still do not create a consistent pattern of either outperforming or underperforming the cluster size unevenness found by applying RBF to CBFVs ( fig S2a). As no representation universally results in more even clusters, a variety of CBFVs and random projections should be investigated when choosing the best representation for clustering a dataset. Application of kernel methods such as RBF are advantageous in this context regardless of representation.

Discussion
Recreation of studies discussed in section 3.1 shows that, broadly speaking, featurisation methods used in research are not necessarily advantageous over random projections, especially on tasks that are not related to band gaps. Machine learning led research in materials science often aims to highlight the success of a machine learning model either in a materials discovery pipeline, as a proof of concept that a model can learn from a given dataset, or a proof of concept that a property can be predicted. As such, the exact implementation of a CBFV and its effectiveness when compared to other CBFVs are often not included in the main text of a paper. Comparison studies thus facilitate evaluation of the impact of the CBFVs on ML performance.
With modern libraries such as matminer 37 , creating new featurisation methods and changing existing ones is straightforward. The engineered featurisation methods show no advantage over more widely used, or simpler alternatives, in the tasks considered here.
Both findings here and in previous work suggest that for sufficiently large and balanced datasets, domain knowledge in CBFVs yields only small advantage 5 . Promising results in representation learning could further reduce these advantages 38 , which means the question as to whether these small advantages of feature engineered CBFVs justify the difficulty in comparison between the models using them is an open one.
Choice of representation for a supervised ML algorithm may be influenced by the extent to which the goal of the algorithm is to maximise predictive accuracy for a property (e.g., to screen potential candidates for synthesis), and the extent to which the goal is to gain insight into the causes of that property. Linked to this consideration is the question of whether domain knowledge features are being used as proxy for the composition, or whether the composition is a proxy for the properties of a material which are quantified by the domain knowledge features.
For example, a model trained to predict whether a superconductor has a T c greater than 30K could be trained on a CBFV and find that the number of d electrons is an important indicator for this property. A similar model could be trained using a CompVec representation and find that containing Cu is an important indicator for this property. Whether the number of d electrons is serving as proxy for the presence of Cu in a material or the presence of Cu in a material is a serving as proxy for the number of d electrons is  Fig. 8 (a) r 2 performance of regression tasks measured in LOCO-CV clustered using different CBFVs, and measured using traditional 80-20 split (labelled as "Not LOCO-CV"). In all cases the model was trained using data in a CompVec representation, so we can better examine the effect of LOCO-CV on the measurement drawn from a given model. We see in many of the cases the same model which performs well in traditional 80-20 split training regimen fails to converge in LOCO-CV measurements. (b) Application of RBF to CBFVs before K-means clustering for LOCO-CV results in much fewer models failing to converge than those seen in (a) (a) (b) Fig. 9 (a) Reduction in cluster size unevenness (standard deviation in cluster size) of different CBFVs when compared to equal sized random projections of composition vectors across different datasets with no kernel applied. While random projection consistently outperformed CompVec, all other CBFVs form more even clusters than an equally sized random projection. (b) Average cluster size unevenness found using K-means clustering on datasets featurised using random projections of various sizes. Cluster size variances are normalised between 1 and 0 for each dataset (as different datasets would be expected to cluster with different amounts of ease), and then averaged for each size of random projection and each kernel. RBF and skewed χ 2 is seen to reduce cluster size uneveness, with the projections of approximately 100 dimensions performing better than larger projections. a matter of perspective. Bearing this difference in perspective in mind may help guide towards use of a representation which is best suited for the workflow in which a machine learning algorithm is being used. If we use ML to gain insight into the causes of properties and phenomena, then examining the importance of different domain knowledge areas in a CBFV for an algorithm will allow us to do that. This would suggest that the task becomes a matter of finding the best set of properties for an element to adequately explain how it interacts with the chemistries of a compound. At this point experimenting with various combinations of elemental properties becomes appealing. However, to justify this approach adequate analysis of which properties are important is needed. When choosing a representation to maximise predictive accuracy, domain knowledge seems to provide some advantage for some tasks examined here (particularly band gap prediction tasks). However we do not think this evidence, nor that found in previous work 5 , is sufficient to reject featurisation methods without domain knowledge such as fractional encoding of composition or random projections, for more complex or parameter dependant algorithms. When using a CBFV, random projection offers a helpful baseline for performance as it is simple to implement and works fairly well. Their single hyperparameter is the size of the projection, which allows one to draw conclusions as to the usefulness of a CBFV under investigation without introducing the size of a representation as a contributing factor for its performance.
Extrapolatory power is particularly pertinent in the materials discovery field, thus previous work presented LOCO-CV as a way to estimate the extrapolatory power of a supervised machine learning algorithm 6 . LOCO-CV (along with many other linear algorithms such as principal component analysis), relies on linear separability in the data. We show that, regardless of representation being used, kernels such as RBF are advantageous in reducing cluster size unevenness, and so should be strongly considered where such linear algorithms are applied. This reduction in cluster size unevenness tackles previously discussed caveats to LOCO-CV and results in more reliable model convergence ( fig. 8).
We examine the use of random projections to featurise chemical compositions to be used with kernelised LOCO-CV. As for other CBFVs examined, random projections used in conjunction with kernel methods produce more even clusters than without kernel methods. However, no representation (either CBFV or random projection) consistently resulted in more even clusters than all other representations. While most of the time CBFVs found more even clusters than random projections (with the exception of CompVec), these findings were not universal across datasets tested. Kernel methods applied to random projections resulted in cluster sizes being even enough so as to be usable in the LOCO-CV algorithm without negatively impacting conclusions drawn from measurements taken using this method.
Random projections and kernelised LOCO-CV can be used together to create a generalised workflow for evaluating the extrapolatory power of a supervised machine learning algorithm, which can be used regardless of input representation to the machine learning algorithm in question. This can be combined with using a random projection as input representation to the machine learning algorithm to see a baseline measure of extrapolatory power which prospective CBFVs can be compared against to measure their usefulness.

Conclusion
We demonstrate random projections are a generic and powerful way to featurise compositions for material property prediction. This is motivated by fundamental principles discussed in the Johnson-Lendenstrauss lemma 26 ; randomly projecting a composition vector can be used to move such vectors into a different dimensional space while preserving relationships between points in a dataset (within some error). These random projections have only a single hyperparameter (the size of the projection), which allows us to isolate the relationships between the dimensionality of a representation, and the predictive performance of algorithms trained using that representation. Random projections can be used as a baseline representation to examine what benefit is added by domain knowledge imbued into CBFVs.
We investigate how common CBFVs could be used in ten property prediction tasks from literature, in order to establish what advantage domain knowledge offers in constructing such vectors. With the notable exception of band gap prediction tasks, CBFVs engineered to incorporate domain knowledge do not substantially outperform an equal sized random projection for most prediction tasks investigated here. If the purpose of an ML model is to maximise predictive performance, the choice of using one of many complex representations (e.g., CBFVs) should be justified by demonstrating an advantage over a random projection of the same size.
We present kernelised LOCO-CV to overcome issues with imbalanced cluster sizes that often occur when performing linear clustering on material sciences datasets. The application of kernel methods, such as the RBF examined here, to data before K-means clustering leads to more even cluster sizes across many different datasets and input representations. Further, using these kernelmodified clusters in LOCO-CV led to more reliable model convergence in the models examined here. Applying kernels in LOCO-CV is independent of representations used by a supervised machine learning algorithm, so we strongly suggest that researchers looking to deploy LOCO-CV use the kernelised version presented here. Both random projections and kernelised LOCO-CV can be implemented independently or together.
We trained over 70 random forest models across ten property predictions tasks found in the materials science literature to show that random projections are a reliable baseline to use when evaluating a CBFV. We have also evaluated over 36,000 K-means clustering applications, on the datasets used in these tasks as well as on the ICSD, and have shown that applying kernel functions to these data before K-means clustering results in more evenly sized clusters, and more reliable model convergence when these clusters are used in LOCO-CV. Our findings provide a basis for materials scientists in selecting and evaluating representations and laying out evaluation workflows.

Methods
Above experiments were implemented in Python using RF, Kmeans clustering and kernel method algorithms from the sci-kit learn library 9 . Hyperparameters of all sci-kit learn algorithms were set to default as of version 2.4.1, with the exception of the value of k for K-means clustering which was varied between 2 and 10 as needed for the LOCO-CV algorithm. While data standardisation was sometimes done before application of K-means clustering (as detailed in the supplementary information section S2), data standardisation was not done before application use of RFs as by their nature RFs consider dimensions independently making such standardisation redundant.
Graphs were plotted with the MatPlotLib library 39 with the exception of fig. 8 which was also uses the Seaborn library 40 . Featurisation was done using the utilities provided with the github associated with Murdock et al. 5 , with the exception of CompVec which was implemented from scratch, and case study specific featurisations, which were obtained in supplementary information for the relevant case study. All implementations, are made available through the associated git repository as are data used in this study 34 .