ProcedureT5: Adaptive Experimental Procedure Prediction with Data-Augmented Pre-Training and Multi-Source Data Integration
Abstract
Computer-aided synthesis planning (CASP) has shown strong potential to accelerate chemical re search. However, a key challenge remains: the lack of effective automated techniques to translate computer-generated synthesis routes into executable experimental procedures, which still require extensive planning and evaluation by chemists. To address this gap, we introduce ProcedureT5, an approach that integrates chemistry-oriented pre-trained models with augmented multi-source datasets to enhance the prediction of experimental procedures across broader scenarios. Our method achieves state-of-the-art performance on the Pistachio dataset - a collection of reaction procedures derived from US patent literature, showing a 4-point increase in BLEU score and a 1.22% im provement in exact-match accuracy compared to existing methods. Additionally, we curate a small expert-annotated dataset, Orgsyn, consisting of verified organic synthesis procedures, to assess the model’s performance in more diverse applications. Fine-tuning ProcedureT5 on the Orgsyn dataset demonstrates its adaptability, yielding a BLEU score of 40.34 and an average similarity of 49.72%. This work underscores the crucial role of ProcedureT5 in bridging the gap between computational synthesis planning and practical laboratory implementation.
Please wait while we load your content...