Open Access Article
Mashrafee Aryan†
a,
Daniel Struble†a,
Felix Campbell
b,
Saroj Upretia,
S. M. Ashik Abedine,
Aahil Khambawla
c,
Jeetain Mittal
cd,
Michael S. Dimitriyev
e,
Emily B. Pentzer
de,
Svetlana A. Sukhishvili
e,
Xiaodan Gu
a and
Boran Ma
*a
aSchool of Polymer Science and Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA. E-mail: boran.ma@usm.edu
bDepartment of Chemistry, The University of the South, Sewanee, TN 37383, USA
cArtie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX 77843, USA
dDepartment of Chemistry, Texas A&M University, College Station, TX 77843, USA
eDepartment of Materials Science and Engineering, Texas A&M University, College Station, TX 77843, USA
First published on 28th May 2026
Machine learning (ML) has heavily influenced the way scientific study is done with demonstrated successes in nearly every field. However, furthering performance and explainability of ML models in increasingly complex systems and with increasingly demanding outcomes requires a significant influx of high-quality data. To that end, this Review covers some of the techniques, instrumentation, and methodologies that have shown promise for significantly accelerating the discovery of polymer materials, optimization of their properties, and elucidation of property–application relationships through high throughput (HT) experimentation, characterization, and analysis. Attention is given to not only ML, but also to hardware advancements and their synergy with computational tools. Multiple studies are highlighted that demonstrate effective implementation of HT synthetic, data acquisition, or analytical methods, often with full integration of ML for the creation of fully autonomous workflows. We present our outlook and perspectives on the incorporation of HT techniques for the discovery and study of polymer materials within a broad range of applications, and include practical considerations for implementing HT methods at the laboratory scale.
While substantial progress has been made toward the rational design of polymer materials and deconvoluting the relationships between structure, processing, and properties, a number of significant roadblocks still exist. Of those, the lack of standardized, accessible, reusable data has been the most critical barrier to data-driven discovery.10 However, recent advancements in high throughput (HT) methods for polymer synthesis, characterization, and property modeling and prediction, enabled by improved access to robotics and automation tools along with coding and machine learning (ML) algorithms have shown promise for accelerating the discovery and design of applied polymer materials.17,18
ML encompasses a variety of techniques that can be broadly classified into classification, regression, and clustering tasks.19 This diversity of modeling approaches, combined with relatively simple implementation enabled by advances in modern computation, makes ML an extremely useful and versatile tool for tackling complex, high-dimensional data that is common in polymer science and engineering. In the last 10 years alone, publications using ML for polymer science have increased more than 100-fold (research papers in 2025 vs. 2015, Web of Science, keywords: polymer, ‘machine learning’), demonstrating the transformative nature of this tool for polymer scientists.
However, the Achilles’ Heel of many ML models is their data-hungry nature, with many models requiring at minimum hundreds of data points to achieve reasonable performance. Relying solely on historic data or databases presents issues with accessibility, data provenance, and lacking or missing metadata. In addition, data of unknown quality or non-reproducible data may be unknowingly introduced.20–22 Reliable polymer data presents additional complexities due to factors such as stochastic structure, dispersity, processing history dependence, and complex synthetic pathways that are often inadequately documented in available data.11,12 To that end, developing HT workflows to synthesize, characterize, and model polymers and polymeric materials is crucial for advancing and accelerating study for applied polymers.
Traditional synthesis, characterization, and analysis methods are extremely slow, labor-intensive, and expensive by HT/ML standards.23 Many recent studies have increasingly focused on the development of HT experimentation platforms for accelerated workflows, often paired with ML modeling and ML-guided automation. HT platforms can synthesize and characterize large libraries of polymers in a short period of time with minimal human input. With ML models, researchers can accelerate or automate characterization tasks and create predictive models for polymer properties. By using large datasets generated from HT workflows integrated with ML prediction framework, it is now possible to build increasingly autonomous, closed-loop platforms for polymeric materials discovery and optimization.24–28
The goal of this Review is to highlight recent developments in both HT methods that are of particular importance for polymer science, as well as their incorporation with ML to accelerate the study and innovation of polymer materials across a range of application spaces. Our focus on the synergy of HT/ML and its impact within a wide range of practical applications compliments a number of other excellent reviews on other HT techniques and strategies,31–34 ML for polymer science,35,36 and on HT/ML for specific applications like polymer therapeutics,37 biomaterials,38 and organic solar cells.39
The remainder of this Review is organized and discussed in four sections (Fig. 1) related to (1) HT techniques and instrumentation for polymer synthesis, (2) HT approaches to characterization (including data acquisition and analysis), (3) the synergy of HT techniques with ML, and (4) showcase studies of applied polymers. Individual sections build upon one another, as HT synthesis necessitates HT measurement, which necessitates HT analysis and interpretation. Together, these approaches can generate large volumes of data ideal for ML modeling.
![]() | ||
| Fig. 1 Schematic overview of the experimental workflow: (1) synthesizing material, (2) taking measurements of properties of interest, (3) using ML to accelerate analysis, create models, and guide further experimentation, (4) use of new models or knowledge in application like (4A) rational design of materials with targeted modulus values or (4B) polymer solubility models. 4A adapted from ref. 29 with permission from the American Chemical Society, copyright 2024, licensed under CC-BY 4.0. 4B adapted from ref. 30 with permission from the Royal Society of Chemistry, copyright 2025, licensed under CC-BY-NC 3.0. | ||
Ultimately, combining HT experimentation with ML can enable closed-loop workflows and increasingly autonomous laboratory systems to accelerate experimentation and discovery for applied polymers. It should be noted that, at this time, few systems exist without bottlenecks, and that this idealized workflow of HT synthesis, HT characterization, ML, repeat, poses many obstacles that are yet to be solved not only from a technical perspective, but also from financial, logistical, and human capital perspectives. Recently, fully closed-loop systems incorporating all HT and ML components have started to be realized,40,41 but broad application of fully HT/ML workflows remains aspirational for many labs and systems. This Review, therefore, focuses on many state-of-the-art techniques and instrumentation, often independent of a wholly HT or closed-loop system. At the end of this Review, we provide our perspectives and outlook for future directions of this field, including potential pathways to more widespread use and integration of the techniques and instrumentation highlighted herein.
One subfield that implements flow methods more commonly is the production of medical grade polymers. The consistency and fine-tuned control offered by flow synthesis is highly beneficial since batch-to-batch variability from traditional synthesis can lead to dramatically different outcomes for biological systems.62 This has been particularly relevant in recent years for drug delivery applications. Flow reactors have been used to fabricate polymers for drug delivery such as polymer-matrix nanomaterials,47,58,63 protein-loaded nanogels,64 and core cross-linked polymeric micelles.65 Studies that focused on optimizing consistency exhibited good control, demonstrating low particle polydispersity with reported values ranging from 0.1 to 0.001 (DLS-derived size distribution index),63,65 as well as controllable morphology through augmentation of flow conditions.58 One study reported a 12× throughput increase when converting from batch to flow.63 In another study, monomer conversion increased from 63% in batch to 97% in flow over an identical time frame due to faster flow reaction kinetics.47 It was posited that these kinetics could be attributed to higher mixing rates and shorter path lengths present in flow reactors.66 These studies on medical-grade polymers highlight key advantages of flow synthesis, including increased throughput and high purity, which could be extended to other areas of applied polymer research.
While flow polymerization offers significant advantages, there are limitations to be considered. Crosslinked, insoluble, or high molecular weight polymers can lead to clogging of flow reactors because of high viscosity or the formation of precipitates. Chanchaona et al. reported the flow synthesis of hypercrosslinked polymers and difficulties faced due to the formation of insoluble products and solvent adsorption. However, by optimizing reactor design and conditions, these limitations were overcome and they reported 32- to 117-fold higher productivity than comparable batch reactions.67 Additionally, reactor materials could face incompatibility with reactants, solvents, or catalysts.47 One study demonstrated the fabrication of a reactor with perfluoropolyether and subsequent thermal treatment, providing high chemical resistance that allows for the highly reactive RAFT polymerization.47 Another concern is that scaling up individual reactors can reduce the consistency of flow-based systems. For example, increasing reaction chamber diameters lowers heat transfer and can lead to different or varying mixing conditions that affect the formation of desired product.68 This, again, is why numbering-up is a more common route to scale up flow systems.
Among the most common forms of automation are modular platforms, robotic arms, and liquid-handling robots. “Modular platform” is a term that can be applied to almost any form of in-lab automation, but more specifically it can be considered as a platform with tailorable structure and a variety of modules to choose from for customizable experimentation.69 Modular platforms provide a basis for building up integrated synthesis, purification, processing, and/or characterization systems. These are found in polymer science for the production of polymer nanotubes,70 drug delivery polymers,71 and a variety of other applications.72,73 Modular platforms for parallel synthesis can be based on compatibility with 96-well plates commonly used in the biosciences. 96-well plates have seen frequent use in polymer science due to their low cost, easy maintenance or replacement, and simple implementation.10,44,74
Robotic arms provide the classical imagery of robotics integration into laboratories and have seen use in the development of application-driven polymers, such as integration into an automated polymer press, and the Polybot for electrical thin film processing.75–77 Robotic arms shine at repetitive, multi-step tasks where their precision and dexterity allows them to accomplish tasks comparable to human input.69
Liquid-handling robots have been most widely used in life sciences.78 They are particularly useful for parallel polymerization to generate polymer libraries37,79,80 for the exploration of huge chemical spaces to elucidate structure–property relationships (Fig. 3A). Polymer libraries have been made across a wide range of application spaces, including medicine,81,82 electronics,24 and biodegradable plastics.83 For the pharmaceutical industry, library creation is considered a critical step for the discovery of new candidate molecules.84,85
![]() | ||
| Fig. 3 Automation of key steps in the synthetic workflow. (A) “Four steps” of RAFT polymerization library creation via Opentrons liquid handling robot. Reproduced from ref. 79 with permission from Wiley-VCH GmbH, copyright 2023, licensed under CC BY-NC-ND 4.0. (B) Automated dialysis system attached to liquid handling robot. Adapted from ref. 86 with permission from MDPI, copyright 2022, licensed under CC-BY 4.0. (C) Automated continuous dialysis system integrated with a synthesis robot. Reproduced from ref. 87 with permission from MDPI, copyright 2020, licensed under CC-BY 4.0. | ||
There are many demonstrations of automated polymer library synthesis producing diverse polymers for characterization and use. Continuous flow with autonomous control of parameters has been demonstrated for the synthesis of a library of a hundred distinct block copolymers in just nine minutes.88 Liquid-handling robots have been demonstrated for polymer library creation using various mechanisms such as RAFT polymerization and solid-phase synthesis, including for applications such as cholesterol-lowering drugs.37,79,80,89
Common difficulties for robotic laboratory integration manifest in high upfront cost and limitations of current robotic hardware.10 Robotic integration for polymers specifically faces several unique challenges. Due to the high viscosity of polymers, clogging of micropipettes in liquid-handling robots and other similar systems can occur.90 In addition, many polymer starting materials are solids (monomers, catalysts, fillers) or require processing (powder mixing, molding, pressing) for which commercially available robotics systems are unsuited.91 Certain types of polymerization are oxygen-sensitive, such as living/controlled polymerization, which cannot be used in conventionally open-air robotics systems.79 Adapting off-the-shelf robotics platforms to the needs of a functional polymer laboratory often requires further costs, expertise, and customized equipment.
The most explored method of automated purification is dialysis. Dialysis can be automated with liquid-handling robotics86 as well as flow systems92,93 (Fig. 3B and C). Flow systems may prove particularly attractive for pairing with dialysis, as higher dialysate flow rate has been demonstrated to improve clearance rate in other systems.94 In addition to faster purification, dialysis has also been demonstrated to use less solvent than comparable manual approaches.77,87 Ultrafiltration is a flow-compatible, underexplored purification method that handles incomplete conversion while maintaining a reasonable throughput.95 Additionally, gel permutation chromatography has been paired with a 96-well plate with 95% small molecule impurity removal reported.96
In general, HT purification of polymers is an extremely underdeveloped field. In their user guide to high throughput workflows, Day et al. recommend automated synthesis of only high-conversion synthetic reactions to bypass the need for purification, which is commonly practiced with researchers often opting for slower reactions with complete conversion rather than those requiring the setup and optimization of an additional purification step.97,98 However, some approaches like those involved in biopolymer and sequence-defined polymer synthesis can be significantly simplified due to presence or absence of known functionalities that enable selective precipitation of the target polymers99 or efficient removal of impurities and low-molecular-weight oligomers,100 or by enabling simple fractionation by chromatography when target molecular weights are discrete.42,101
Advancing traditional GPC, Murphy et al. have recently demonstrated automated systems capable of resolving dozens of narrowly dispersed polymer fractions that also include variations in block composition and molar mass, from a single parent polymer. This development has enabled the separation of large, well-defined libraries for downstream characterization.107–109
Depending on modality, microscopy can be broadly categorized into Optical Microscopy (OM), Confocal Fluorescence Microscopy (CFM), Electron Microscopy (EM), Atomic Force Microscopy (AFM), Super Resolution Microscopy (SRM), and others. CFM when integrated with HT methodology has found great use in analyzing the compositional heterogeneity of hundreds of olefin polymerization catalyst particles in both 2D and 3D.111 Similarly, HT OM has enabled rapid mapping of phase behavior in polymer blends and coacervates, which is otherwise quite time-consuming with conventional approaches.112,113 High-resolution EM with recent advances like 4D-STEM (scanning transmission electron microscopy) has allowed for precise mapping of crystalline domains in polymers, directly revealing chain arrangements and lattice deformations at submolecular resolution.114
Recently, AFM has seen increased imaging speed and throughput without major compromise in spatial resolution. This has been made possible due to the development of high-speed dynamic mode AFM, adaptive multiloop-mode imaging, and bimodal AFM.115–117 The automation of AFM workflows through robotic sample handling, automated tip exchange, batch-mode imaging, ML, and advanced data analytics is increasingly being used for combinatorial, complex, heterogeneous polymer systems (Fig. 4).60,118,119 Further, the development of polymer-based and 3D-printed AFM tips and cantilevers accompanied by multipurpose 3DTIPs and self-actuated cantilevers has helped reduce tip wear and improve imaging in air or water environments.120,121
![]() | ||
| Fig. 4 Combinatorial, automated supramolecular polymer blends facilitated by robotic synthesis, automated AFM, and ML analysis and modeling techniques. Adapted from ref. 60 with permission from ChemRxiv, copyright 2025, licensed under CC-BY-NC-ND 4.0. | ||
Most of the lab-based XRS experiments are limited by inherently low energy X-ray sources, and typically require tens of minutes to hours for measurement of an individual sample.124 Fortunately, modern synchrotron sources are able to deliver extremely intense, focused, and monochromatic X-ray beams allowing for collection of high-quality data in seconds to minutes.125 Using these sources, small/wide-angle X-ray scattering (SAXS/WAXS) experiments have been carried out over shorter time frames to investigate the structural evolution of different polymer sample types such as conjugated polymers, block copolymers, etc. HT scattering data can then be leveraged to establish structure–property relationships that can lead to tailored polymer systems with targeted specific properties.40,110
Polymer rheology has also seen advancements toward HT characterization, with researchers employing a range of methods including electromechanical,131 centrifugal132 and even optical microscopy,112,133 typically accompanied with computational analysis and often in conjunction with ML.134 These techniques all use relatively small amounts of material, a benefit for complex or difficult-to-synthesize polymers. Some of these HT rheological techniques can even obtain high-resolution time-dependent data in situ, allowing alternative HT options for evaluating related properties or behavior like polymerization kinetics.133
While regression predicts continuous outputs based on inputs, classification refers to predicting discrete groups based on inputs. An example of classification is a model that predicts “soluble” or “insoluble” as binary classes. Classification models can also be non-parametric and high-dimensional and are able to account for chemical, topological, and environmental descriptors to classify the behavior of even complex solutions.
Clustering is the third common ML task in polymer science and is useful for identifying common groupings in large datasets. The most typical form of clustering is dimensionality reduction – creating 2D or 3D groupings of high-dimensional data using some form of grouping to maintain global spacing between points or to maintain local cluster relationships. Further discussion and introduction to ML can be found in literature.10,136–140
As suggested herein, the most critical part of any ML model is the quality of its training data. ML models can be trained on experimental data, historic data curated from datasets, or a combination thereof. Database-derived datasets are typically larger than experimental datasets generated for a specific study, and often cover a larger range of inputs (e.g., more chemical diversity), but they frequently lack essential experimental details (e.g., reaction time, processing conditions, environmental humidity, etc.) or provenance required to evaluate data quality and reproducibility.
The process of training ML models varies from task to task but can generally be broken into the following steps (Fig. 5): (1) curate a dataset either experimentally or from database(s).141 (2) Preprocess data.142,143 This may include cleaning, scaling, or transforming, and determining appropriate tools for fingerprinting if needed, e.g., RDkit,144 Morgan Fingerprint,145 etc. (3) Data splitting.146 If a test set is being generated from the available data, known as a holdout set, that data must be sequestered from other training data. It may be necessary to stratify the test set selection or otherwise ensure it is representative of the full dataset,147 particularly if the dataset is small. 5% and 10% of the total dataset are common holdout sizes. (4) Model training. Fitting the model to the training data is typically done using multiple instances of the same model architecture. This is usually accomplished by k-fold cross-validation, where k = 5 and k = 10 are common choices.148 Cross-validation results in a more robust estimate of the model's performance and generalizability to unseen data, making it a more reliable choice for hyperparameter selection than simply training the model one time on all available data. Hyperparameter optimization is also done during this time, where different options for model architectures and learning behaviors can be specified to see what best captures the underlying patterns in the data. Hyperparameter optimization is typically conducted using random search, grid search, or an optimization approach like Bayesian optimization.149 (5) Model evaluation. During training, the “best” model is evaluated using an appropriate loss function. For regression, the most common loss functions are mean squared error and mean absolute error. For classification, cross-entropy loss and hinge loss are the most common loss functions. Once trained, overall model performance is typically evaluated using coefficient of determination (R2), root mean squared error, and/or mean absolute error for regression; and accuracy, recall, and/or F1 score for classification.150,151 While cross-validation techniques should improve generalization of models, it is critical to bear in mind that models may overfit, particularly problematic when predicting on data not derived from the original dataset.
This overview covered some basic concepts for understanding the rest of the section, but ML includes a range of techniques and methodologies that are beyond the scope of this Review and often with specific considerations on a per-dataset or per-task basis. Readers interested in polymer-specific ML best practices are directed to other resources related to ML models,19,152 ML methods and tutorials,139,152 polymer representations and selections for ML,153–156 model validation strategies,152,156 uncertainty and extrapolation limits of ML models,157,158 explainable ML,159–161 and more detailed discussions on the state of the ML field with respect to polymers, specifically.10,36,162–164
Given the complexity of fully elucidating characterization spectra, robust ML for comprehensive interpretation remains a challenge. However, in specific applications with limited need for generalization there has been some progress. For example, ML has identified sterile or contaminated cells,170 and has been used to predict particular features within UV-Vis spectra that correlate with phototoxicity.171 A somewhat easier task, predicting spectra from compounds has been successfully demonstrated and shows promise as a screening tool. ML methods have been used to predict UV-Vis spectra for drug discovery172 and photodetector design.173
![]() | ||
| Fig. 6 Example workflow for an automated microscopy pipeline with in-line ML. Reprinted with permission from ref. 177 with permission from the American Chemical Society, copyright 2021. Original elements (STEM imaging) adapted from ref. 178 with permission from Springer Nature, copyright 2021, licensed under CC-BY 4.0. | ||
Additionally, clever approaches to bypassing the difficulty of interpreting complex scattering data have been developed that leverage ML. The Jayaraman group has developed a Computational Reverse-Engineering Analysis for Scattering Experiments (CREASE) method that uses ML to both automate and accelerate the interpretation of complex scattering patterns (Fig. 7).191–194 ML models can also identify block copolymer phases from noisy data, including gyroid and σ phases, from scattering data without requiring any chemical information and in real time.180
![]() | ||
| Fig. 7 CREASE scattering elucidation. (A) Overview of the CREASE workflow. Reproduced from ref. 179 with permission from the American Chemical Society, copyright 2021. (B) CREASE applied to small-angle neutron scattering profiles where black curves are experimental and colored curves are from CREASE. Reproduced from ref. 191 with permission from the American Chemical Society, copyright 2023. Both figures licensed under CC-BY-NC-ND 4.0. | ||
ML algorithms have been used to predict polymer-related properties that have eluded more traditional modeling techniques, such as polymer solubility,30,202–204 glass-transition temperatures,205,206 dielectric constants,207–209 and more. Several of these will be highlighted in the showcase studies section of this Review. However, the broad range of available model architectures necessitates careful decision-making and awareness of the underlying data. For example, Tg prediction is a common task for polymer ML models. Researchers have had success on this task using all sorts of models and datasets,205,206,210 but a comprehensive benchmark study by Tao et al. demonstrated that for a given task and dataset, not all ML architectures and modeling decisions are made equal.153 As ML modeling becomes increasingly integrated into polymer science, we emphasize that making ML models is easy, but making good ML models is challenging.
By allowing AL models to suggest experiments that will further improve model performance, human intervention can be minimized in the DBTL loop. Letting integrated systems run autonomously can improve throughput59 and consistency. The addition of AL methods can also improve data efficiency by selecting statistically optimized experiments. The ability of AL to intelligently design and implement experiments is the final component to enabling truly closed-loop automated workflows that are suggested by HT experimental systems. Several recent works using SDLs for applied polymers have focused on multi-objective optimization of several polymer nanoparticle properties such as maximum monomer conversion and minimized dispersity,214 optimization of electronic thin film processing conditions (Fig. 8),24,28 and design of copolymers for enzyme stabilization.215 While AL methods can be built from the ground up, workflows have also been democratized to make using these tools more accessible than ever as in the case with the gpCAM library that is both user-friendly for new practitioners of AL and robust to more demanding needs for advanced users.216
![]() | ||
| Fig. 8 Demonstration of robotic arm integration in HT synthesis and fabrication of electronic thin films with ML-assisted workflow. Adapted from ref. 24 with permission from Springer Nature, copyright 2025, licensed under CC-BY 4.0. | ||
Recent work by Xu et al. presented an SDL that optimizes the lower critical solution temperature (LCST) of poly(N-isopropylacrylamide) (PNIPAM).217 PNIPAM exhibits a sharp and reversible phase transition near physiological temperature which makes it an ideal candidate for stimuli-responsive biomaterials applicable as drug delivery devices, artificial tissue scaffolds, and biosensors. The authors built a low-cost, modular platform that integrates robotic liquid-handling for precise formulation with BO that uses a GPR surrogate model. The system prepared many solution compositions in parallel, then measured cloud point transitions with automated optics and fed the results to the optimizer. Within only a few cycles, the platform converged on user-defined LCST targets in both two-salt and three-salt solution systems while making efficient use of experiments. The study showed how closed-loop HT/ML experimentation can be used to control a thermal transition with direct bearing on biomedical function. It also provides an accessible blueprint for autonomous experimentation that can be adapted to other thermoresponsive polymers where a specific transition window is required for device performance.
The Soft Materials Research and Technology (SMART) Lab at the University of Pennsylvania has used HT/ML to fabricate microfluidic double emulsion droplets. These polymer droplets are biomimetic with encapsulating layers allowing for protection of inner materials while external structure allows for integration into cells. These materials are commonly recognized to have potential for drug delivery applications; however, these cells require precision manufacturing, subject to a wide design space and highly sensitive fabrication and processing. In medical grade materials, even slight imperfections can lead to fatal consequences. O'Callaghan et al. utilized microscale flow chemistry to drastically limit batch-to-batch variability of polymer-based protocells, reporting the consistent formation of microcapsules as small as 14 microns – the smallest to date at that time.218 More recently, the SMART laboratory has utilized the ML-empowered Automated Double Emulsion Droplet Library generator (ADLib) to evaluate microdroplets in real-time with ML.219 They used ML in the form of automated control algorithms allowing for real-time response to feedback on droplet formation. This allowed the ML to modify flow rates and other factors in response to environmental changes to achieve uniform microdroplet fabrication at nearly 6 drops per second, a degree of control and throughput that would be unattainable without ML-based automation.
:
acceptor ratios to allow combinatorial design space exploration. This platform has demonstrated the ability to process over 600 devices across dozens of processing conditions while using less than 50 mg of material.221 Manually fabricating so many devices would be extremely time-consuming and potentially error-prone, but with their HT method they were able to rapidly traverse the design space. Notably, not all HT techniques require ML – more traditional statistical tests can also be effective. In this case, they used one-way ANOVA to determine that solvent choice was the dominant factor affecting device efficiency. They produced blade-coated OPVs with efficiency ranging from 0.08% to 6.43%, with the best candidate demonstrating efficiency up to 14% when optimally fabricated, approaching state-of-the-art at time of publishing.222 Finally, their large parameter space allowed the creation of performance maps in Hansen solubility space, which they identified as being an effective and computationally inexpensive way of identifying good solvents for similar systems.
In another gradient-based OPV study, roll-to-roll slot-die coating for even more continuous deposition and therefore more HT sampling was used to generate more than 2200 OPVs from in situ blend formulations over a large range of film thickness, acceptor
:
acceptor, and donor
:
acceptor ratios. The ML models trained on the resulting dataset found high-performance compositions which had OPVs with 10.2% power conversion efficiency.223 Moreover, by integrating ML in-line with automation, SDL workflows with BO have demonstrated the ability to sample over 2000 different quaternary active layers in 7 days with significantly reduced material consumption (Fig. 9).220 These studies highlight the power of HT techniques for the discovery and optimization of polymer electronics.
![]() | ||
| Fig. 9 Self-driving lab for the fabrication of polymer thin films. Reproduced from ref. 220 with permission from Wiley-VCH GmbH, copyright 2020, licensed under CC-BY 4.0. | ||
Similarly, rational design of biodegradable PLA derivatives has been accelerated through the use of ML-assisted characterization and design optimization methods,224 where BO can be used to drastically improve search efficiency. Biodegradation of polymers has also been studied in other contexts, with an ML model from Lin et al. reaching predictive R2 of 0.66 for the degradability of polymers in aqueous environments.71
000 epoxy candidates. Hu and colleagues then synthesized a top-ranked resin that showed superior modulus and elongation compared with existing materials. This integration of modeling and focused validation is important for structural thermosets used in coatings, aerospace composites, and electronics where brittle failure often limits adoption of the material. Their method demonstrated how ML can narrow the search space to a small set of high-value formulations that are likely to meet stringent mechanical specifications.
In another example, Jain et al. applied an AL framework to acrylate photopolymers for additive manufacturing.29 The study explored a ternary monomer space that combined rigid monomers, elastomeric monomers, and a crosslinker. The authors used GPR with a two-stage hierarchical model to predict Young's modulus, tensile strength, ultimate strain, and hardness. They then used noisy expected hypervolume improvement, a multi-objective extension of the expected improvement acquisition function commonly used in BO, to recommend the next experimental batch. Each cycle involved automated resin preparation, UV curing, and mechanical testing, forming a closed-loop. The system rapidly converged on modulus values within 10% of the targets. The selected formulations were validated by fabricating a multimaterial printed structure that combined stiff regions and soft regions, shown in Fig. 10. This work highlights the feasibility of HT/ML for tuning mechanical windows of photocurable systems that are important in aerospace fixtures, automotive parts, and custom medical devices.
![]() | ||
| Fig. 10 AL can quickly converge on target properties like Young's modulus values of photopolymers in this multi-material print. Reproduced from ref. 29 with permission from the American Chemical Society, copyright 2024, licensed under CC-BY 4.0. | ||
In one example, Amrihesari et al. used the Crystal16 parallel crystallizer, a “medium-throughput” automated system, to measure polymer solubility.203 The authors generated a high-quality dataset using turbidity measurements of 30 polymers and 45 solvents across multiple concentrations and temperatures. ML models, including random forest, neural networks, and extreme gradient boosting (XGBoost) regressors, were trained on this dataset to predict transmission-temperature profiles focused on the transmission values during the cooling phase, indicative of solubility behavior. Among the tested approaches, XGBoost showed the best overall performance, achieving an R2 of 0.98 and RMSE of 6% for predicting percent transmission. Notably, the study extended solubility prediction beyond binary classifications (i.e., soluble/insoluble) by incorporating a “partially soluble” category, thereby improving the practical relevance of predictions for industrial applications such as paint and coating formulation, membrane production, and pharmaceutical development. Although the partial solubility category proved more challenging due to data scarcity, targeted data augmentation significantly improved predictive reliability. This work showed how carefully curated experimental data, when coupled with ML, can accelerate solubility mapping, reduce trial-and-error experimentation, and provide actionable guidance for formulation and process design.
A study by Kern et al. addressed a long-standing bottleneck in polymer processing – choosing a solvent that dissolves a given polymer at room temperature by replacing limited Hildebrand/Hansen heuristics with data-driven prediction.202 The authors assembled a curated dataset (3373 polymers, 51 solvents; 11
913 soluble and 8843 insoluble pairs) plus an external set (2909 polymers, 7 solvents) and incorporated structural fingerprints for both polymers and solvents in an attempt to get models to learn physicochemical relationships that might improve generalization beyond the fixed list of solvents used for model training. This research work benchmarked a deep neural network (SolNet2) and a random forest classifier using F1 score under three cross-validation regimes: random, polymer-split (unseen polymers), and solvent-split (unseen solvents). Random forests consistently outperformed the neural net, especially on solvent-split. For their dataset, the incorporation of structural fingerprints did not significantly improve model generalization to unseen solvents. Using Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and leave-one-out analysis they identified the modest generalization as an issue of limited solvent diversity. On the hold-out set, class imbalance and novel chemistries further stressed the models, reinforcing the need for broader, more diverse solvent data and for recording experimental factors (molecular weight, concentration, temperature, partial solubility). Overall, the study established a transparent baseline, offered generalizable solvent fingerprints, and demonstrated a robust ML approach that can be progressively enhanced through the expansion of community-contributed datasets. Further expanding and diversifying polymer–solvent datasets and integrating HT techniques with advanced, interpretable ML models can accelerate polymer solubility prediction and the development of advanced polymeric systems.
The most common HT equipment from these studies are liquid handlers. Given the complexities of accurately size reducing, measuring, and conveying solid reagents, automated equipment for the preparation of solutions with solid reagents remains out of reach for most laboratories. Therefore, liquid-handling robots, pumps, and automated dispensers including spin coaters and 3D printers remain the most common means of accessing HT data acquisition in a typical laboratory setting.
Further, only a couple of studies employ HT techniques throughout the entirety of the workflow. While we have presented many examples of HT synthesis, purification, and characterization techniques; the relatively nascent stage of this research area means many techniques are created or employed in isolation. For example, all OPV papers referenced can perform HT sample creation, but none demonstrated HT substrate cleaning and preparation. It is not uncommon for only one component of synthesis or characterization to be fully HT at this point in time.
Data is, of course, a motivator for HT methods, and is a driver of improved performance for ML models. With only one exception, the highlighted papers utilize datasets with data points on the order of 103 or greater, the largest dataset highlighted in these studies having data points on the order of 105. It is common to supplement experimental data with published datasets when the data is available, a practice demonstrated in about half of these studies. Interestingly, most of these papers do not expressly state throughput, even in papers where HT is the selling point. Quantification of throughput, when available, tends to be vague and abstracted, e.g. “almost 2100 [samples]…within 7 days”.220 Otherwise, the judgement of “high throughput” tends to rely on implied domain knowledge (a reasonable rate using non-HT methods) or on sheer magnitude – 100
000 data points necessitates some type of accelerated data acquisition in any system. Several additional examples from literature are shown in Table 1 to demonstrate how reporting can vary within literature even for comparable systems. Without a standardized way to report throughput, it is extremely difficult to compare systems and define benchmarks for considering processes as HT. Therefore, we propose that throughput should be reported in samples per hour. By adopting a standard reporting method, researchers can easily compare throughput across systems.
| No. | Systems | Throughput |
|---|---|---|
| 1 | PET-RAFT polymerization in multiwell plate | 96 reactions ran in parallel using well plates at a reaction volume of 200 μL.235 |
| 384 reactions in parallel for polymerization using 384-wells at a reaction volume of 40 μL.236 | ||
| 2 | Automated continuous flow platform with inline NMR & online SEC/GPC | Traditional offline GPC systems take 20–40 minutes for analysis, reduced to about 12 minutes per sample.237 |
| A benchtop NMR spectrometer coupled to the outlet of the flow reactor measures spectra every 15 s.238 | ||
| Automated continuous-flow photopolymerization platform with inline NMR monitoring can collect 120 datapoints in less than 2 hours.92 | ||
| 3 | Automated polymer synthesis | Systems allow parallelized synthesis of a wide range of polymers with up to 192 reactions.74 |
| 392 polymers synthesized and analyzed in under 40 hours using a 96-well plate workflow.98 | ||
| 4 | Liquid handling high throughput system | Liquid handling system reduces >80% of the time required for reagent dispensing and allows for a fully automated platform for combinatorial polymer chemistry.89 |
| 672 unique volume transfers usually requires 3 hours of intensive effort. Hamilton MLSTARlet can do that in 30 min.74 |
From a modeling standpoint, by far the most common models are the random forest, Gaussian process, and convolutional neural network architectures, with one or more of these or one of their variants appearing in nearly every study highlighted here. Cross-validation and its variants are also common across many of these studies, though error metrics for both training and evaluation remain varied based on the system under exploration.
HT/ML can be daunting to adopt initially, and the ML space is evolving at an extremely rapid pace. However, trends across these studies suggest that incorporation of even some HT methods for repeatable tasks, combined with databases and utilizing even a small number of relatively simple, well-documented models is an extremely viable way of accelerating scientific discovery, and that the barrier to entry may not be as high as it appears at first glance.
Readers interested in practical considerations for setting up and operating SDLs are directed to several recent papers. Day et al. discuss high-level practical considerations for HT workflows including the fundamentals of experimental design and design space evaluation for HT experiments, as well as synthetic approaches and characterization and analysis pipelines for HT generation of polymer libraries.97 Maroulis et al. provide detailed instructions and software for the build and operation of 2 different liquid handlers each costing around $1000. They also cover some useful fundamentals for active learning approaches, and provide validation of their lab-made automated systems.239
Data quality and database design are crucial for successful ML model training, since ML models are only as credible as the data they learn from. This is an outstanding issue for polymer informatics, since ML studies based solely on pre-existing datasets suffer from issues of dataset size, quality, and labeling, furthering the need for workflows that can generate high volumes of data with consistent quality, controlled error, and consistent labels for data and metadata.17,18,20–22 Further, consideration should be given to the storage, labeling, and sharing of negative data as well – since the lack of publications within “unsuccessful” experimental space can lead to re-treading the same ground or biasing of downstream ML models.240 At minimum, every data point should contain: (1) sample ID with unambiguous polymer representation, (2) complete process record from synthesis to purification, and processing history, and (3) measurements with raw data files, equipment calibration metadata, replicate count, and uncertainty level. Standardization of data recording is necessary for elevating data quality from “best practice” to infrastructure. Well-defined vocabularies and machine-readable schemas should define everything required for data entry. We created a polymer sample recordchecklist with essential information for data recording critical to ML studies, shown in Fig. 11. We also provide an example of a polymer sample record using our checklist in the SI. However, the investment required for full standardization is unrealistic for individual groups, therefore the field needs centralized efforts from organizations such as International Union of Pure and Applied Chemistry (IUPAC) and chemical societies in developing community-maintained ontologies, validators, and ingest pipelines and frameworks. Other disciplines have shown the payoff of such efforts. For example, Crystallographic Information File (CIF),241 developed by the International Union of Crystallography, is a machine-readable text format for representing crystallographic information. CIF enables interoperability across software and archives, and tools like checkCIF support automated validation.242
![]() | ||
| Fig. 11 Abbreviated polymer sample record checklist – some data that should be recorded for every polymer sample (see SI for complete list and example). | ||
For HT hardware, the ecosystem spans fully integrated robotic platforms to do-it-yourself (DIY) systems, with corresponding trade-offs in capability, flexibility, and cost. At the high end, installations that combine automated liquid-handling, sample storage, and in-line characterization typically reach the million-dollar scale, offering reliability, vendor support, and safety features but often locking users into proprietary software and constrained workflows. Mid-range, modular systems allow users to customize liquid handlers, imaging stations, and basic analytical tools à la carte, balancing cost against configurability. Entry-level commercial liquid handlers capable of basic dispensing and simple workflows are available on the order of ∼$15
000 at time of publication, lowering the barrier for adoption in individual labs. At the other extreme, open-source designs and 3D-printed parts enable DIY construction of custom liquid-handling and peripheral modules (e.g., camera, sample grabber, etc.) at a fraction of commercial costs, with the added benefit of full control over software and code. These DIY routes, however, transfer the burden from capital expenditure to personnel time and technical expertise in mechanical assembly, electronics, and software integration, and they introduce variability in reliability, safety, and documentation. Across all price-points, lab-scale automation still has major problems with solid-handling and working with heterogeneous mixtures, in addition to lacking dexterity required for complex transfers, sample preparation, and other common experimental routines.240
Given this landscape, field-level coordination led by IUPAC and chemical societies such as the Royal Society of Chemistry, American Chemical Society, and Chinese Chemical Society is essential. Repositories developed and maintained by the research community for hardware designs, materials checklists, calibration procedures, and control software are desired. Analogous to open libraries for 3D-print files, such repositories would accelerate reproducibility and reduce duplicated engineering effort. Community-level demonstration237 and documentation239 efforts can further lower the barrier. Because not all individual research groups can afford high-end platforms, shared user facilities, national laboratories, and so-called “cloud labs”243,244 are well-positioned to host complex HT experiments instrumentation with trained staff. These facilities have numerous demonstrations of HT/ML workflows,245–248 and are actively developing tools and workflows to make HT/ML experimentation even more accessible249 in addition to developing open-source HT tools for use by the community.250,251 Improving access to these facilities will enable researchers to test-run on local, lower-cost systems and execute demanding tasks on centralized infrastructure. This hybrid model, where a local modular system is used for prototyping work, with community standards for interoperability and centralized facilities for complex projects, offers a pragmatic path toward broadening access and reducing the time from concept to results.
Experimental designs should prioritize sampling strategies and allocate for replicates to ensure reproducibility and efficiency within database and hardware constraints. HT experiments can generate data points rapidly, but each data point is associated with a real cost. In our anecdotal experience, a typical polymer sample, from synthesis, processing, to characterization, often costs tens of dollars. Scaling to 1000 samples, the total cost rapidly reaches tens of thousands of dollars. Consideration should therefore be made to cataloging and sharing of both positive and negative data240 to reduce net costs while accelerating output. Additionally, studies should remain cognizant that ML models inherently struggle with out-of-distribution predictions (i.e. extrapolation) when selecting experimental designs or ML datasets.158,252 In parallel, AL approaches should be leveraged to assist in decision-making and selecting the next experiments. Deliberate planning of experiments, combined with AL, will shorten the polymers design cycle, leading to more cost-efficient information gain and faster convergence to targeted materials properties and performance.
In sum, the promise of HT/ML for applied polymers will be strengthened by a research community that builds together. Shared standards, open-source infrastructure, and collaborative user facilities can turn isolated knowledge gain into an accessible and interoperable knowledge base. As datasets and code become truly reusable, the field can explore mechanisms and principles that unify across chemistries and scales. With an aligned purpose for sharing, HT/ML can mature into an integrative, powerful engine for polymer design and innovation, benefiting society at large, from precision healthcare and environmental remediation to sustainable, circular materials economies.
Acquisition function: a function used in active learning to determine which data point or experiment should be selected for evaluation next.
Active learning: a machine learning strategy where the model selects the most promising new data points to acquire next. This can be based on learning the experimental space as efficiently as possible, optimizing toward a target, or balancing those objectives simultaneously.
Architecture: the structural design of a model, defining how components are arranged and connected. For example, the architecture of a neural network is described in terms of how many layers and how many nodes per layer it has, while the architecture of a random forest is described in terms of how many trees with how many branches and of what depth it has.
Automated: part or parts of the process are performed automatically by robotics or computer-driven systems. This can include autosamplers, automated GPC, or other systems that can perform a repetitive, predetermined task.
Autonomous: refers to systems that perform an automated task in response to some input. For example, an instrument with in-line monitoring that can adjust its settings to maintain a target output or property.
Bayesian optimization: an active learning strategy for iterative modeling that uses past results to efficiently select promising new combinations. BO can be used for hyperparameter optimization (as opposed to random or grid search), or for performing active learning over experimental variables.
Classification: a model that assigns inputs to discrete, predefined categories or classes. For example, soluble vs. insoluble (binary classification), block copolymer morphology (multi-class classification).
Closed-loop workflow: a workflow where modeling outputs, especially from an active learning model, are fed back as inputs to guide subsequent actions and experiments.
Clustering: a model that groups data points based on similarity without predefined labels, often used for identifying higher-order relationships between data points, and frequently paired with dimensionality-reduction methods for visualization.
Coefficient of determination (R2): a regression performance metric that quantifies the variation in the observed data that is explained by the model predictions.
Convolutional neural network: a type of neural network designed to learn spatial or local patterns in structured data, commonly used for images.
Cross-validation: a general framework for estimating model performance by repeatedly training and evaluating a model on different data splits.
Design space: the range of input variables or experimental conditions over which a system is explored.
Data cleaning: the process of identifying and correcting errors, inconsistencies, or missing values in a dataset.
Data scaling: rescaling numerical features to comparable ranges (e.g., normalization or standardization) to improve model performance. Some model architectures like neural networks are sensitive to magnitude and features like molecular weight will dominate model predictions over features like concentration unless scaling is applied.
Data transforming: applying mathematical operations to data (e.g., logarithms, encodings) to improve interpretability or modeling behavior.
Data splitting: dividing data into subsets (e.g. training, validation, test) for model training and evaluation.
Descriptor: a numerical feature that encodes some property of a material, molecule, or system for use as an input. ML models do not take inputs like text describing monomer composition, so the numerical representation of that text would be the monomer's descriptor. Sometimes used synonymously with “feature”.
Expected improvement: an acquisition function for active learning that selects new data points for evaluation based on the expected gain over the current best observed result.
Fingerprint: a fixed-length numerical vector that encodes structural or compositional information, commonly used for molecules or polymers.
F1 score: a classification performance metric that is the harmonic mean of precision (true positives/[true positives + false positives]) and recall (true positives/[true positives + false negatives]).
Generalization: a model's ability to make accurate predictions on previously unseen data. How useful a model trained on a specific set of data can also be on related but different data.
Gaussian process regression: a non-parametric regression method that models predictions as probability distributions. Used especially in active learning due to inherently providing both predictions and uncertainty estimates.
Graph neural network: a neural network architecture that learns spatial or local patterns in graph-structured data by propagating information between connected nodes, commonly used for chemical structures.
Grid search: a hyperparameter optimization strategy that exhaustively evaluates combinations of predefined parameter values. More systematic than random search but may miss optimal combinations that were not predefined.
High-dimensional: describes data with a large number of features relative to the number of observations, often complicating analysis and modeling since ML models can easily overfit to noise or redundant signals. Modeling dimensions can be thought of as synonymous with inputs, e.g., a model that predicts elastic modulus based on molecular weight, polymer identity, density, and temperature is 4-dimensional.
Holdout set: a portion of data reserved only for final model evaluation; see Test set.
Hyperparameter: a model setting chosen before training (e.g., learning rate, kernel type, number of nodes and layers) that controls model behavior but is not learned from the data.
k-Fold cross-validation: a cross-validation method where data are split in k subsets (2 ≤ k ≤ N with N being the total number of data points); each subset is used once as validation while the others are used for training. See also cross-validation.
Leave-one-out analysis: a form of k-fold cross-validation where k is equal to the number of data points.
Loss function: a quantitative measure of model error used for training. Common examples include mean absolute error and root mean squared error (for regression), and log loss (classification).
Machine learning: a class of computational methods that “learn” patterns from data to make predictions or decisions without being explicitly programmed with rules.
Mean absolute error: a regression loss function that is the arithmetic mean of the differences between observed values and predictions. MAE excels at simplicity, interpretability, and explainability.
Model: a mathematical object that maps input data (features/variables) to outputs (predictions) based on learned parameters.
Modular platform: an experimental system composed of interchangeable, automated/automatable components that can be independently modified or reconfigured based on need.
Neural network: a machine learning model composed of interconnected layers of computational nodes that learn relationships between inputs and outputs.
Non-parametric model: models that do not assume a fixed functional form (e.g., linear or exponential) and whose complexity can grow with the amount of available data. These models are often treated as “black boxes” because their internal behavior is not easily interpretable.
One-way ANOVA (analysis of variance): a statistical test used to determine whether the means of three or more groups differ significantly based on a single input variable.
Orthogonality: the degree to which variables vary independently of one another, such that the effect of each variable can be distinguished.
Polymer representation: a numerical or symbolic encoding of polymer chemistry and structure (e.g., SMILES string) used as an input to computational models. Descriptors and fingerprints are derived from polymer representations.
Random forest: a machine learning method that combines predictions from many decision trees trained on different subsets of the data. The ensemble of trees over a single, complex decision tree tends to improve accuracy and robustness.
Random search: a hyperparameter optimization strategy that samples parameter values randomly from defined ranges. Less structured than grid search but stochasticity may allow it to find unintuitive combinations missed by grid search.
Regression: a model that predicts a continuous numerical value. For example, glass transition temperature, modulus, percent conversion.
Roll-to-roll: a continuous manufacturing process in which material is processed as it moves from one roll to another, enabling high throughput and scalable production.
Root mean squared error: a regression loss function that is the quadratic mean of the differences between observed values and predictions. High sensitivity to outliers makes RMSE useful when punishing outliers is a higher priority.
Self-driving: refers to systems that have automated and autonomous elements, but that also engage in decision-making. That is, autonomous and closed-loop systems.
Stratification: a data-splitting strategy that preserves the distribution of key variables (e.g., class labels) across subsets. For example, a solubility classification model trained on 90% insoluble examples is unlikely to perform well on a test set that contains 90% soluble examples.
Test set: a subset of data not used during model training or tuning, used solely to assess final model performance.
Training set: the subset of data used to fit model parameters, what the model “learns” patterns from.
Uniform manifold approximation and projection (UMAP): a dimensionality reduction technique that projects high-dimensional data into a lower-dimensional space while attempting to preserve local structure from the high-dimensional space.
XGBoost (extreme gradient boosting): a machine learning method that uses gradient boosting to build an ensemble of decision trees sequentially, potentially improving prediction performance.
Supplementary information (SI) is available. Polymer sample record checklist and example. See DOI: https://doi.org/10.1039/d5lp00380f.
Footnote |
| † These authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2026 |