Open Access Article
Ronak Tali
a,
Ankush Kumar Mishraa,
Devesh Lohiaa,
Jacob Paul Mautheb,
Justin Scott Neu
c,
Sung-Joo Kwond,
Yusuf Olanrewaju
b,
Aditya Balua,
Goce Trajcevskia,
Franky So
b,
Wei You
c,
Aram Amassian
*b and
Baskar Ganapathysubramanian
*a
aIowa State University, Ames, IA 50010, USA. E-mail: rtali@iastate.edu; akmishra@iastate.edu; devesh@iastate.edu; baditya@iastate.edu; gocet25@iastate.edu; baskarg@iastate.edu
bNorth Carolina State University, Raleigh, NC 27695, USA. E-mail: jpmauthe@ncsu.edu; yaolanre@ncsu.edu; fso@ncsu.edu; aamassi@ncsu.edu
cUniversity of North Carolina, Chapel Hill, NC 27599, USA. E-mail: neuj@email.unc.edu; wyou@email.unc.edu
dUniversity of Washington, Seattle, WA 98195, USA. E-mail: sjkwon12@uw.edu
First published on 7th October 2025
The Shared Experiment Aggregation and Retrieval System (SEARS) is an open-source, lightweight, cloud-native platform that captures, versions, and exposes materials-experiment data and metadata via FAIR, programmatic interfaces. Designed for distributed, multi-lab workflows, SEARS provides configurable, ontology-driven data-entry screens backed by a public definitions registry (terms, units, provenance, versioning); automatic measurement capture and immutable audit trails; storage of arbitrary file types with JSON sidecars; real-time visualization for tabular data; and a documented REST API and Python SDK for closed-loop analysis (e.g., adaptive design of experiments) and model building (e.g., QSPR). We illustrate SEARS on doping studies of the high mobility conjugated polymer, pBTTT, with the dopant, F4TCNQ, where experimental and data-science teams iterated across sites using the API to propose and execute new processing conditions, enabling efficient exploration of ternary co-solvent composition and annealing temperature effects on sheet resistance of doped pBTTT films. SEARS does not claim novelty in these scientific methods; rather, it operationalizes them with rigorous provenance and interoperability, reducing handoff friction and improving reproducibility. Source code (MIT license), installation scripts, and a demonstration instance are provided. By making data findable, accessible, interoperable, and reusable across teams, SEARS lowers the barrier to collaborative materials research and accelerates the path from experiment to insight.
Advanced multiscale characterization techniques now yield rich multi-modal datasets2 – for example, structural, thermal, and electrical properties of a material might be measured across different scales and saved in diverse file formats (e.g., spectroscopy files, microscopy images). Organizing, storing, and retrieving such heterogeneous data streams pose significant challenges. Repeating measurements on the same specimen (common for reliability) further adds layers of data that must be tracked and versioned. To address these challenges and ensure that data can drive decision-making, the community is embracing FAIR data principles (Findable, Accessible, Interoperable, Reusable).3 Adhering to FAIR standards means recording detailed metadata (instrument settings, sample preparation details, etc.) using well-defined ontologies4 and making datasets available in standardized formats. Such practices not only improve reproducibility for independent researchers but also unlock new opportunities for integration with third-party tools. For instance, well-structured data can feed directly into ML algorithms or be queried by emerging large language model (LLM) assistants.5 Interoperability is particularly crucial as standardized data formats enable different experimental workflows and database systems to connect and exchange information seamlessly. This allows the formation of a federated ecosystem of knowledge rather than isolated data silos.
Another key aspect of this evolving landscape is collaborative, cloud-enabled research. Given the increasing complexity of modern materials problems, no single laboratory can house all necessary expertise or instrumentation, and geographically distributed teams are becoming the norm.6 To accelerate discovery, researchers need to collaborate across institutional boundaries, often in real time. This necessitates robust, cloud-based data infrastructure and electronic lab notebooks that multiple labs can access and contribute to concurrently. Essential features include customizability (to accommodate each lab's protocols and data types), reliability and version control (to track contributions and changes), and fine-grained access control (so that each team member can work on a shared experiment securely).7 A cloud-centric approach ensures that data and analysis tools are available on-demand to all collaborators, eliminating traditional barriers of local data storage and allowing experiments to be monitored or even steered remotely. In this context, rather than classical trial-and-error approaches, researchers can harness ML as a tool to predict outcomes and suggest promising experimental conditions.8 Once an ML model is (well) trained on existing data, it can rapidly screen hypothetical scenarios and recommend the most likely candidates for success, potentially narrowing down the experimental search space. This capability transforms the role of the experimenter – guiding experiments by computational insight – and accelerates the cycle of hypothesis to validation. Moreover, sharing such predictive models (in addition to raw data) between laboratories increases the beneficial impacts by enabling researchers to learn models and improve transfer learning in general.
We present SEARS (Shared Experiment Aggregation and Retrieval System), an open-source, cloud-native platform that operationalizes these trends by capturing, versioning, and exposing materials-experiment data and metadata through interoperable, programmatic interfaces. SEARS provides configurable, ontology-driven data-entry screens; a scalable document store with raw-file storage and JSON sidecars; search, tagging, and built-in version control; and a documented REST API with a Python SDK for analysis and closed-loop experimentation. Multiple laboratories can contribute to a shared record in real time, with provenance (owner, lab, timestamps) recorded consistently, and FAIR-compliant exports available for publication or downstream tools. By design, SEARS can be self-hosted and extended (MIT license), allowing teams to integrate existing workflows while preserving reproducibility and auditability.
We illustrate SEARS with several case studies (Sections 6.1 and 6.2) including adaptive design of experiments (ADoE) and quantitative structure–property relationship (QSPR) modeling. Our claim is not novelty in ADoE or QSPR, but that SEARS provides the enabling infrastructure (provenance-rich capture, multi-lab coordination, and API/SDK access) that makes such established methods practical for distributed collaborative teams. We demonstrate this on distributed studies of pBTTT:F4TCNQ, where experiment and data-science teams iterated across sites to refine ternary co-solvent composition and annealing temperature and to train reproducible QSPR models using consistently annotated data.
Recognizing these issues, the community has argued for flexible and FAIR-centered data infrastructure in chemistry and materials. For instance, adopting FAIR principles in database design has been emphasized as crucial for enabling not only findability but also reusability of data across projects.13 A key recommendation is to record all experimental outcomes – including failed or null results – alongside successful data. Logging negative results provides context for interpreting model predictions and helps quantify experimental error, ultimately improving the robustness of any data-driven analysis. Some initiatives have attempted to modernize legacy chemistry databases with more flexible data models: the Royal Society of Chemistry's ChemSpider platform was partially rebuilt on a NoSQL backend to accommodate new data types.14 This change improved extensibility but resulted in a complex hybrid architecture (traditional relational databases coupled with NoSQL stores) that proved hard to maintain and scale. Other proposed solutions have approached data sharing by distributing the data directly to end-users. For example, the Materials Provenance Store (MPS)15 concept involves researchers downloading an entire copy of a database to their local machine and using custom scripts or SQL queries to mine the data. While this ensures full access to the dataset, it creates formidable challenges: every user must overcome steep technical setup and update the data copy regularly, and querying large datasets locally becomes inefficient.
These examples underscore the difficulty of designing data systems that are both powerful and user-friendly. More recent frameworks have capitalized on web technologies to broaden data accessibility. The NOMAD16 repository is one prominent web-based platform that allows materials scientists to upload their results along with rich metadata, making those data publicly searchable and viewable through an online interface. Notably, NOMAD incorporates advanced tools like AI-driven analysis modules and interactive visualizers to help users derive insights from shared data. However, platforms like NOMAD remain after-the-fact repositories – they are geared toward publishing completed results rather than facilitating real-time collaboration during the experimentation process. Once data are uploaded, contributing laboratories typically relinquish some control over how those data are managed and updated. This approach can yield valuable datasets but leaves little opportunity for new laboratories to directly participate or continuously contribute fresh data in an open forum. Finally, there have been case studies demonstrating integrated data workflows within individual labs. One recent example from an electrochemistry laboratory showcased a custom pipeline that spanned from automated data acquisition to analysis and visualization, using a suite of open-source tools orchestrated for that lab's needs.17 This “digital lab” case study highlighted the efficiency gains from linking instruments to data processing in a feedback loop. While this solution was highly specific to that lab's setup and was not released as a generalized tool for others, such solutions represent a powerful tool to accelerate acceleration. Such tools often do not address multi-lab collaboration, focusing only on internal data management.
Within this broader push toward interoperability and reuse, materials-science ontologies typically organize concepts into four classes: substance, process, property, and environment.18 To make these classes actionable across collaborators and tools, it is useful to bind them to a single, public definitions registry that assigns each term a unique identifier and version; a human-readable label; a formal definition; units and, where applicable, permissible values; provenance (originating lab/person); creation and last-updated timestamps; and cross-references. Such a registry provides an unambiguous, searchable reference for every piece of information, improving consistency, reuse, and auditability across studies.
In this broad context, SEARS is designed to fill the existing gap by offering a unifying framework that builds on prior lessons – enabling broad participation, enforcing interoperability and FAIR standards, and empowering researchers to seamlessly share, retrieve, and act on data across laboratory boundaries. While SEARS is designed to be material agnostic, we illustrate the capabilities and utility of SEARS through several popular use cases and a detailed discussion of a multi-university case study to understand and increase the mobility of conjugated polymers through doping.
In the case studies reported here, we exercised these same capabilities end-to-end: data were captured under SEARS ontology terms, exported as FAIR packages, and programmatically consumed to support ADoE iterations and QSPR model training. To illustrate this workflow for readers, we provide resources comprising (i) a de-identified exemplar FAIR dataset with a corresponding CSV representation and (ii) a notebook demonstrating the analysis applied to one case study. The case-study summaries included here are on independent publication tracks; full datasets will be released with their respective publications. Comprising (i) a de-identified exemplar FAIR dataset with its corresponding CSV representation, and (ii) a notebook demonstrating the data analysis workflow applied to one of our case studies. The case study summaries included in this work are on independent publication tracks; full datasets for these studies will be released in conjunction with their respective publications.
Unlike SQL databases, which rely on tables, columns, and rows, NoSQL databases organize objects within flexible constructs known as documents. Importantly, these documents do not require uniform structures, allowing the storage of objects with varying attributes within the same collection. Additionally, NoSQL databases do not rely on traditional concepts like joins and normalization, prioritizing schema flexibility over strict relational consistency. However, this flexibility comes at a trade-off: NoSQL databases do not inherently provide ACID22 guarantees, which ensure transactional reliability in SQL databases. This is not a limitation in the context of SEARS, as ACID compliance is typically crucial for online transaction processing (OLTP),23 which is not the primary focus of our system.
Fig. 1 illustrates key features of our database schema. First, each experiment, along with its metadata and associated measurements, is stored as a self-contained, nested object, mirroring the native object structure of JavaScript.† Second, SEARS does not impose constraints such as predefined column names or data types, eliminating the need for rigid headers. Third, each experiment is treated as an independent unit, allowing documents with different structures to coexist within the same database collection. Fourth, this inherent flexibility enables modifications to experiment schemas dynamically, without affecting previously stored data. This adaptability extends to other system components, such as the application front end, allowing iterative improvements without disrupting historical records. Finally, NoSQL databases offer significantly faster read and write operations compared to SQL databases, making them well-suited for handling large-scale experimental datasets efficiently.
We ensured our components were loosely coupled and interacted with each other purely via REST APIs.21 This means that as long as the API messaging interfaces remain consistent, we can deploy an updated version of any single component with no impact whatsoever on other components, with just a momentary downtime. This also means plugging in newer components can be achieved without impacting other running components and with no downtime. Second, we adopted a cloud-based deployment model, stitching together services from different cloud providers. This heterogeneity allows us to utilize best-in-class services from different providers and, importantly, prevents the chances of a single point of failure. Finally, we focused on building a joint ownership model of the experiment data. We do this by providing a complete set of features for research laboratories to seamlessly work together on a single experiment such that they have full visibility on every new experimental data point or file entered into SEARS. We also make it easy to track the entire history of the experiment for all members.
• Central dashboard: the central dashboard serves as the primary interface for users upon logging into SEARS. It provides a comprehensive control panel where users can view, create, update, upload, download, search, and share experimental data. Each experiment entry displays essential identifiers such as experiment name, creation date, associated laboratory, and owner. To facilitate rapid access, each experiment is assigned a unique Quick Response (QR) code, which can be scanned to quickly retrieve experiment details. Additionally, the dashboard features a search bar for users to efficiently locate experiments by name. The dashboard includes filters for Owner and Lab to scope views and downloads to specific contributors or groups.
• New experiment creation: this feature enables users to enter all relevant metadata associated with an experiment on a single screen. The interface organizes metadata into clearly defined sections for ease of use. A key characteristic of this functionality is that once an experiment is created, its metadata remains immutable, ensuring consistency and integrity.
• Viewing and editing an existing experiment: SEARS provides a structured interface for managing experimental data, organized into multiple tabs representing different measurement categories, such as sample thickness. Within each tab, users can input measurement values for different batches as needed. If a category requires the upload of raw data files (e.g., current–voltage characteristics), users can drag and drop the relevant CSV file into the designated area. Once uploaded, these files are available for download and visualization. By default, the first two columns of a CSV file are automatically assigned as the x and y values for data plotting, streamlining the visualization process. It is important to note that while we are able to plot data only from uploaded csv files in real time, we support uploading of files of any type.
NoSQL databases are designed with the principle that data accessed together should be stored together. This contrasts with SQL databases, which prioritize data sparsity and schema normalization to minimize redundancy. NoSQL databases follow the document object model, where collections of objects (analogous to rows in SQL databases) are aggregated into documents (similar to tables in SQL). A key advantage of this approach is its inherent flexibility: unlike SQL databases that enforce rigid column structures, NoSQL databases allow dynamic and adaptive data storage. Objects store information as key-value pairs, provided they are serializable and can be transmitted over a TCP network. Additionally, key-value pairs can be encapsulated within arrays, allowing for hierarchical and structured data storage.
An inspection of Fig. 1 reveals that these objects closely resemble JSON structures, which are widely used as the de facto standard for internet data exchange. SEARS' front-end application converts all incoming metadata and measurements into a single JSON payload, which is then transmitted to the MongoDB database. MongoDB processes this payload and stores it as a BSON document, a format that closely mirrors JSON but with additional optimizations for storage efficiency. Once stored, MongoDB assigns a unique experiment ID, which allows the front-end application to retrieve and organize experiment data efficiently. Furthermore, all stored data undergo automatic backups to ensure protection against potential data loss, maintaining the integrity and reliability of the system.
We would like to make a special note of the material-agnostic nature of SEARS. This means that the way data is stored and ultimately viewed with SEARS can be completely customized at the time of deployment. This includes both the metadata and the measurements. While we have pre-configured SEARS for illustrative purposes, we provide a detailed SEARS customization guide with examples in our software repo. Section 5.7 provides more details on customization.
To enhance data durability, a replica of each file is simultaneously stored in an in-house long-term storage system. To further safeguard against data loss, each storage location employs triple replication, meaning that every file is backed up three times per location. As a result, each file in the SEARS framework is maintained in six independent copies, distributed across both cloud and local storage infrastructures. This comprehensive replication strategy ensures that all high-value experimental data remain secure and resilient against potential system failures or accidental deletions.
Fig. 3 illustrates the SEARS API ecosystem, which is encapsulated within a secure authentication layer. Within this framework, we implement/utilize several specialized API categories:
• Database APIs: these APIs support CRUD (Create, Read, Update, and Delete) operations for managing experiment data in MongoDB. The front-end application interacts with these APIs to execute all database-related transactions. To optimize cost and performance, SEARS leverages MongoDB Atlas cloud functions for executing database operations efficiently.
• Storage APIs: these APIs handle file storage and retrieval by interacting directly with both cloud-based and in-house storage solutions. To maintain consistency, all write operations are executed in parallel across both storage locations, ensuring redundancy and data integrity. The storage APIs are implemented using the FASTAPI framework, selected for its performance and robust security features.
• Internal APIs: in addition to core database and storage functions, SEARS includes internal APIs dedicated to logging and system health monitoring. These APIs are restricted to framework administrators and are not accessible to standard users.
• Native MongoDB APIs: SEARS also provides direct access to MongoDB's native APIs, allowing users to retrieve structured experimental data for advanced analysis, machine learning workflows, and data-driven decision-making.
This API-driven architecture ensures that SEARS remains modular, scalable, and efficient, enabling simple integration with external tools and emerging data science applications.
We close this section with a note that the implementation of SEARS is publicly available at GitHub. This repository contains complete explanations to download, set up and customize SEARS to your needs. In addition we provide a demonstration video of SEARS here. We also want to make a special note of the role AI-based code editors potentially play in quickly customizing SEARS. Fully customizing SEARS involves edits to one or two files at the maximum; see detailed instructions in our repo. Please refer in particular to our customization guide that provides illustrative examples of how to customize SEARS using AI prompts. We anticipate AI editors to easily comprehend the repeatability in our logic and quickly make edits to suit the end user's needs.
We anticipate SEARS to evolve over time as we incorporate an increasing number of features in our future iterations. Accordingly, our change management policy will entail the following; first, our data schema will remain incremental, that is, we will never delete any data elements, we will only make additions as needed, second, we will provide migration scripts that will allow bulk transfer of documents from an existing schema to a new schema and third, new versions of SEARS will be distributed as new code repositories without touching those for older versions. These safety features will allow SEARS users to remain confident that in case of any issue during version migration, they will always be able to switch back to an older version.
While ADoE is a well-established approach in materials science25,26 for optimizing experimental conditions,27 the use of SEARS enables a collaborative, distributed workflow that would be impractical with traditional methods. In particular, SEARS allows experimental data generated at one or more laboratories to be uploaded, accessed, and analyzed in real time by geographically distributed research teams. This streamlines the iterative process of model updating and experimental planning: multiple research groups can simultaneously contribute new experimental results, which are immediately available for updating the regression models that drive ADoE optimization. This shared, cloud-based infrastructure removes barriers to collaboration, reduces data silos, and accelerates convergence towards optimal processing conditions. Thus, SEARS facilitates a more efficient, collaborative, and reproducible ADoE workflow, particularly when dealing with large, heterogeneous datasets from multiple contributors.
In our use case, the experimental team uploaded their data to SEARS, where the data science group, at a different geographical location, accessed it via a Python script. Using this data, an ADoE framework was applied to propose the next set of processing conditions, refining ternary co-solvent concentrations and polymer annealing temperatures based on prior results. Over three iterative campaigns, SEARS facilitated seamless data exchange, enabling the identification of regions associated with both high and low sheet resistance. The initial processing conditions were selected via Latin hypercube sampling, while subsequent adjustments were determined through ADoE optimizations, as shown in Fig. 4. The SEARS platform functioned as a central intermediary between experimentalists and data scientists, ensuring smooth integration. The ADoE samples were further used to understand important features that influence the sheet resistance. A detailed analysis of the features is part of a separate publication.28
![]() | ||
| Fig. 4 ADoE batches to determine regions of higher(red points) and lower sheet resistance (blue points) using the SEARS portal as an intermediary between the experimentalist and data scientist. | ||
To establish a robust predictive mapping, we trained several machine learning models using the aggregated data; the Random Forest algorithm yielded the best predictive performance. The dataset was split into an 80
:
20 train–test ratio, with model results summarized in Table 1. Once trained, the QSPR model enabled rapid prediction of sheet resistance throughout the design space, as illustrated in Fig. 5.
| Model | Input | Output | R2 |
|---|---|---|---|
| Random | % CB, % DCB | Sheet resistance | 0.41 |
| Forest | % Tol, anneal temp (°C) | (Ω/□) |
SEARS played a crucial role by providing a unified repository where experimental data from various sources could be uploaded, curated, and accessed in real time. This centralized infrastructure supports continuous and collaborative model development: as new experimental results are contributed by any participating lab, the QSPR models can be instantly retrained and updated, enabling near real-time feedback on experimental progress and facilitating data-driven decision-making. The platform's architecture thus streamlines the otherwise labor-intensive processes of data collation and version control, ensuring data integrity and accessibility for all collaborators.
In addition to processing parameters, we extended our QSPR modeling efforts to incorporate spectral data uploaded to SEARS. This enabled a systematic, data-driven identification of spectral features most relevant for explaining variations in sheet resistance and conductivity, and allowed us to directly compare these features with those identified by expert judgment. A detailed analysis of featurization techniques, QSPR models, and feature importance ranking is presented in a separate publication.29
While machine learning approaches to QSPR modeling are well-established,30 the integration with SEARS unlocks several practical advantages. Most notably, it enables interactive and adaptive experimentation: researchers can monitor QSPR model performance in real time, guiding decisions on when sufficient data has been collected or when further experiments are warranted. Furthermore, by aggregating data from multiple laboratories, SEARS supports the construction of more generalizable and robust predictive QSPR models, accelerating both discovery and validation processes in materials science.
Consistent with FAIR principles,32 we ensure that each JSON key is fully defined and documented via our FAIR metadata server, accessible here. The use of JSON format ensures platform agnosticism and serialization, making the data easily accessible and interoperable across most programming environments. By aligning SEARS with FAIR standards, we enhance data reusability, interoperability, and integration with external computational tools, further strengthening its utility in scientific research.
Additionally, such a database fosters transparency, minimizes redundancy in data collection, and promotes interdisciplinary collaboration. In this case study, we combined complementary four-point probe current–voltage measurements with UV-visible absorption spectra and X-ray scattering data from instruments at NC State University and Brookhaven National Lab (NSLS-II). SEARS facilitated data hosting and visualization, enabling direct comparison of duplicate samples analyzed at both facilities, as illustrated in Fig. 6.
As discussed in Section 2, there have been efforts to create systems that can improve collaboration among researchers; however, given the generality of SEARS, we anticipate broader adoption of SEARS (and SEARS like tools) across the research community, expanding its role as a scalable and adaptable solution for experimental data management. As part of future work, we intend to extend the capability of SEARS to enable seamless inclusion of novel experimental templates/use cases and to be adaptable with respect to FAIR regulations in the future. We also intend to include an optional notification trigger in future versions of SEARS. This feature will allow lab members to be notified each time there is any new activity in SEARS. We understand that this could be beneficial in certain low experiment frequency scenarios.
Footnote |
| † This aligns naturally with how JavaScript defines objects. |
| This journal is © The Royal Society of Chemistry 2025 |