Open Access Article
Jason L.
Wu
ab,
David M.
Friday
ab,
Changhyun
Hwang
bc,
Seungjoo
Yi
bd,
Tiara C.
Torres-Flores
bc,
Martin D.
Burke
abefg,
Ying
Diao
abc,
Charles M.
Schroeder
abcd and
Nicholas E.
Jackson
*ab
aDepartment of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. E-mail: jacksonn@illinois.edu
bMolecule Maker Lab, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
cDepartment of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
dDepartment of Materials Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
eCarle R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
fDepartment of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL, USA
gCarle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL, USA
First published on 19th November 2025
Machine learning (ML) is increasingly central to chemical discovery, yet most efforts remain confined to distributed and isolated research groups, limiting external validation and community engagement. Here, we introduce a generalizable mode of scientific outreach that couples a published study to a community-engaged test set, enabling post-publication evaluation by the broader ML community. This approach is demonstrated using a prior study on AI-guided discovery of photostable light-harvesting small molecules. After publishing an experimental dataset and in-house ML models, we leveraged automated block chemistry to synthesize nine additional light-harvesting molecules to serve as a blinded community test set. We then hosted an open Kaggle competition where we challenged the world community to outperform our best in-house predictive photostability model. In only one month, this competition received >700 submissions, including several innovative strategies that improved upon our previously published results. Given the success of this competition, we propose community-engaged test sets as a blueprint for post-publication benchmarking that democratizes access to high-quality experimental data, encourages innovative scientific engagement, and strengthens cross-disciplinary collaboration in the chemical sciences.
Despite the technological momentum of ML methods, there is a growing awareness that the scientific community must develop more inclusive and engaging ways of connecting with the public. ML presents a rare opportunity: its widespread accessibility, intuitive appeal, and applicability across disciplines make it an ideal entry point for engaging students, hobbyists, and educators alike. Public enthusiasm for ML is high, yet structured avenues for meaningful participation in the chemical sciences are limited. Simultaneously, the need for effective scientific communication has never been more urgent in a global landscape shaped by rapid technological change and distrust of expertise; the ability to convey the significance and impact of scientific discoveries is critical. Traditional modes of scientific dissemination, such as peer-reviewed publications and conference presentations, often fall short in reaching broad audiences. Taken together, these considerations motivate the need for innovative frameworks that not only explain research outcomes, but actively invite participation, deepen trust, and demonstrate the relevance of scientific work to societal challenges.
In this work, we introduce a new model of scientific community engagement by directly interfacing experimental chemistry with community-engaged test sets for ML. Building on a prior study of small-molecule photostability, we leveraged automated block chemistry to construct a blinded test set consisting of newly synthesized compounds and hosted a public Kaggle competition in which participants predicted degradation properties of the new molecules using open-source tools and training data from the prior study (Fig. 1). This approach, referred to as community-engaged test sets, invites broad participation in post-publication model validation while fostering dialogue between experimental chemistry and the global community. Here, we describe the design, execution, and outcomes of our approach, and propose this model as a scalable framework for democratizing access to data, amplifying scientific visibility, and accelerating innovation in chemistry.
![]() | ||
| Fig. 1 Schematic of the community-engaged test set paradigm. During our previous study of light-harvesting small molecules, we performed in-house ML to predict photostability of our dataset.9 After publishing our results, we hosted a global hackathon for photostability prediction by synthesizing and characterizing an additional community test set. Participants were provided with our published dataset (42 molecules) to train their ML models, which were then evaluated on the unseen community test set photostability values (7 molecules). | ||
Hackathons, events where computer scientists collaboratively build projects over a short, intense period of time, have become a low-cost, high-reward outreach strategy.18 Recent competitions such as Nomad2018 Predicting Transparent Conductor,19 Novozymes Enzyme Stability Prediction,20 Predicting Molecular Properties,21 and Bristol-Myers Squibb – Molecular Translation22 underscore the increasing intersection of ML and chemistry. Moreover, a biennial competition that has garnered success is the Critical Assessment of Protein Structure Prediction (CASP) competition.23 For the CASP competition, hundreds of research groups attempt to predict the three-dimensional structure of a variety of proteins from only its amino acid sequence. Notably, the winners of CASP in 2018 and 2020 were AlphaFold and AlphaFold2 respectively, demonstrating the importance of hosting worldwide ML competitions.24 Given the success of the global hackathon strategy in engaging participation and inventing scientific breakthroughs, we envisioned hosting our own hackathon based on our relatively small experimental photostability dataset. Unlike previous versions, our hackathon would be the first to study structure–function relations of small organic molecules, a fundamental challenge in academia and industry. The absence of such competitions in the organic chemistry domain is largely attributed to the lack of modular synthetic methods as well as automated characterization methods. The emergence of automatable, AI-friendly block chemistry12,16,25–29 and many advances in automated characterization have opened the door for launching such competitions.
A potential challenge in hosting a hackathon using only the photostability dataset from the prior study is the lack of hidden data to evaluate public ML models, so any models trained on the published data would be prone to overfitting. We therefore created a new test set specifically intended for evaluating community ML models in order to conduct a global ML competition for molecular photostability. To this end, we leveraged automated block chemistry to prepare nine additional light-harvesting small molecules consisting of aryl and heteroaryl moieties and measured their photostability exactly as reported in the previous campaign (Fig. 3, see SI for synthesis details). The nine new molecules exhibited a broad range of T80 values (Fig. 3b). Ester bearing molecule (H) and extended bipyridine (I) were added to the training set to balance chemical diversity in the train/test split, leaving the remaining seven molecules to comprise the community-engaged test set. It should be noted that four light-harvesting molecules from the initial campaign were omitted from the dataset due to their T80 being too low to characterize.
We ran the competition between March 24, 2025 and April 26, 2025. At the conclusion, we received a total of 729 submissions from 522 entrants. It is important to note is that our competition only provided a total of $150 in prizes (compared to other competitions that gave out up to $50
000), yet we were still able to garner hundreds of participants in only one month. These outcomes suggest a strong community interest in participating in the scientific discovery process, even when engaging a small but practically important chemistry dataset.
Among the many excellent submissions, we highlight a few of the highest performing submissions. The top performing model, trained by a user who wished to remain anonymous, predicted the photostability of the test set with an MSLE of 1.026 (Fig. 5a). For this user's strategy, they first generated every possible RDKit feature to supplement the 144 features we provided. Then, they used the SelectFromModel class of scikit learn to select the top 35 features when trained on log(T80) rather than T80. Finally, they found that the SVR with a linear kernel predicted the T80 with the lowest MSLE. Unlike our four-feature model, one of the features they included was “fr_pyridine,” which is the number of pyridine rings in the molecule. Based on Fig. 3, all the bipyridines exhibited low T80 values, so their model correctly identified that as the number of bipyridine rings increases, the T80 decreases. This novel insight exemplifies the utility of a blinded test set that extends into a chemical space beyond that reported in the original published work.
![]() | ||
| Fig. 5 (a) Results of the top performing models from the Kaggle competition. (b) Distrubution of scores for all submissions from the Kaggle competition. | ||
The second-best model was trained by Valterri Valo (29 years old, Finland, Data Science/ML consultant), who like the top performing user, added over 100 RDKit features as well as 100 Morgan Fingerprints to the original dataset. Interestingly, Valterri augmented the data by adding methyl groups or replacing halogens on the original molecules while keeping the same T80 values to generate 74 new molecules. He ultimately chose 13 features and trained an XGBoost model to produce an MSLE of 1.208 (Fig. 5a).
The best pretrained model was implemented by Nikita Sharma (22 years old, India, Computer Science Undergraduate), who used the seyonec/ChemBERTa-zinc-base-v1 (ref. 32) model to embed each molecule, and trained a Ridge regressor on the embeddings to produce an MSLE of 1.760 (Fig. 5a). Interestingly, the pretrained model did not lead to the best results, which supports the observation that issues arise when model complexity outweighs the small dataset sizes that proliferate the chemical sciences.
Overall, our Kaggle competition was successful in engaging community scientists to improve upon our previously best-performing model. We were delighted to see that our participants used a variety of strategies such as using a log transformation of the T80 measurements or augmenting the data with chemically modified molecules. In addition, the participants uncovered new scientific trends in our dataset such as the negative correlation between number of pyridine rings and T80.
We envision the community-engaged test set paradigm as a new direction for future community engagement and scientific outreach in the natural sciences. Instead of publishing the entire data set produced from a synthetic campaign, research groups could (for example) withhold a small test set (∼10% of the data) for a public Kaggle completion. A small amount of work to clean up the dataset into Kaggle's ML-accessible format and clearly explain features and targets paves the way for broad engagement and ML-driven discovery in chemistry. Alternatively, as automatable block chemistry has increased accessibility to the synthesis and thus testing of new small molecules with a wide range of useful functions, the bar is lowered for researchers to generate post-publication datasets for community testing.
We provide a playbook for hosting your first Kaggle competition in the SI section. From a broad perspective, hosting a Kaggle competition would give attention to the initial publication and garner interest for future advances on the topic. In this way, the community-engaged test sets paradigm serves to democratize scientific discovery and align the objectives of experimental science and ML.
Provided the success of this Kaggle competition, it is interesting to consider future adaptations to our approach to further engage community interest. An obviously fruitful direction for future competitions is to integrate experimental design and validation more cohesively into the competition objective. For example, rather than asking participants to regress over the community-engaged test set, we could task competitors with training models and directly suggesting the next best experiments to run (i.e. the most informative molecule to synthesize on the grounds of exploration and exploitation of the design space). Subsequently, our automated synthesis robots could synthesize the suggested molecules and validate the hypotheses generated from the Kaggle competition participants. This future paradigm would concurrently allow the participants to directly contribute to the research and strengthen the chemical interpretation of the ML models. Such improved outreach strategies moving forward will aim to further increase democratization of ML and chemistry within the broader community.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00424a.
| This journal is © The Royal Society of Chemistry 2026 |