Nico C.
Röttcher
*ab,
Gun D.
Akkoc
ab,
Selina
Finger
ab,
Birk
Fritsch
a,
Jonas
Möller
a,
Karl J. J.
Mayrhofer
ab and
Dominik
Dworschak
*a
aForschungszentrum Jülich GmbH, Helmholtz Institute for Renewable Energy Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany. E-mail: n.roettcher@fz-juelich.de; d.dworschak@fz-juelich.de
bDepartment of Chemical and Biological Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Cauerstr. 1, 91058 Erlangen, Germany
First published on 11th January 2024
The pressing need for improved energy materials calls for an acceleration of research to expedite their commercial application for the energy transition. To explore the vast amount of material candidates, developing high-throughput setups as well as enhancing knowledge production by synthesis of published data is essential. Therefore, managing data in a clearly defined structure in compliance with the FAIR principles is imperative. However, current data workflows from laboratory to publication often imply poorly structured data, limiting acceleration in materials research. Here, we demonstrate a comprehensive data management tool structuring data throughout its life-cycle from acquisition, analysis, and visualization to publication by means of an SQL database. This guarantees a persistent relation between experimental data, corresponding metadata, and analysis parameters. Manual interaction required to handle data is reduced to a minimum by integrating data acquisition devices (LabVIEW), defining script-based data analysis and curation routines, as well as procedurally uploading data upon publication (Python). Keeping the link, published data can be traced back to underlying experimental raw and metadata. While we highlight our developments for operando coupled electrochemical experiments, the used approach is generally applicable for other fields given its inherent customizability. Design of such automated data workflows is essential to develop efficient high-throughput setups. Further, they pave the way for self-driving labs and facilitate the applicability of machine learning methods.
However, navigating in the haystack of previously published data suffers from the limitations of text-based search. Despite advances in scraping and tabulating data from the literature by using Natural Language Processing (NLP) techniques14–17 in tandem with data extraction techniques from figures, the quality of the obtained data is still limited due to incompleteness and lack of structure of published data.18 Moreover, due to societal bias, publication of unsuccessful attempts is often neglected, despite being essential not only for modeling but also to prevent other researchers from conducting redundant experiments.19–21 By structuring research data, the application of ML techniques can benefit twofold from simpler inclusion of unsuccessful experiments in a tabulated format as well as from more reliable extraction of input data.20 Thus, it is clear that publishing all data in a tabulated and structured format is one of the most pressing improvements in the publishing of scientific data.22
To overcome the lack of data quality, the FAIR principles were defined – a guideline for researchers on how to manage and share data to render it Findable, Accessible, Interoperable, as well as Reusable and in this way improve the quality of big data approaches.22 In specific, they highlight the importance of linking experimental data with enriched metadata, any data treatment and openly sharing original data parallel to publications. To fulfill these requirements, in many scientific fields structured databases have been developed such as ISA in the field of life science,23 Open Phacts in pharmaceutical research,24 and NOMAD for (computational) materials research.25 For experimental materials research in general, and electrochemistry in specific, the design of a common database is challenging given the vast diversity of experimental approaches. On this way, there are advances in defining a standardized ontology for battery research which sets the ground for sharing research data across different groups.26,27 On a more specialized scale, echemdb provides a database solution for cyclic voltammetry (CV) data.28 Beyond uploading data into databases, numerous data repository services have been established. For electrochemical research, however, there are only a few experimental publications taking steps toward publishing raw data and detailed descriptions of processing procedures.29–31 FAIR-compliant publishing of electrochemical data is far from being standard despite being essential to compare and reproduce experimental results.32,33 Insufficient time has been identified as a key barrier for researchers to share their data inter alia due to data preparation and structuring only on short notice before publication.34 In contrast, managing research data from the initial acquisition up to its publication as outlined in the FAIR principle in a structured way is key to enable open data sharing in a time-efficient manner. However, a comprehensive data workflow is still missing for materials science in general and electrochemistry in particular.35
Instead, conventional data management often relies on dynamically growing folder structures with customized file naming schemes for data and metadata files as well as handwritten notes. To keep files organized and links between metadata and data comprehensible, strategies for the proper naming of files and folders have been suggested.36,37 However, as ‘relevant’ parameters are different for researchers even in the same research group and are suspected to change in the course of a study of single researchers, different naming conventions will evolve. On top, the processing of data is often performed using click-based graphical user interfaces (GUIs), lacking scaleability and tracking of data lineage.
While this approach is flexible and easy to use at the beginning, the effort to keep information organized in multiple files without inherently linking experimental data, metadata, and processed data grows steadily in the course of an experimental study. Furthermore, considering the reuse of data by other (future) researchers within the same group and of any published data by the whole community, it is clear that understanding and tracking data lineage within a file-folder structure is a time-consuming task. However, if not undertaken, information that could be extracted from the data or by correlating to other data, is lost – an issue that has to be prevented, especially regarding the application of advanced data science methods.
Thus, a comprehensive data management strategy is key to ensure the quality of data and avoid time-consuming data (re)structuring at a later stage of an experimental study. To support researchers on the way toward comprehensively structured data, a wide range of electronic lab notebooks (ELNs) have been developed.25,38,39 ELNs assist in structuring data by creating templates for experiments with metadata being filled in pre-defined fields and data files being attached to the entry. A database in the backend manages the links between experimental data, metadata, and entries of physical items. Yet, these approaches do not a priori provide a data processing platform, require substantial user interaction, and are limited in customizability. Alternatively, the concept of laboratory-scale individualized databases is suggested.40–42
Based on this, here, we showcase a comprehensive workflow integrating data acquisition devices, analysis, visualization, and FAIR-compliant publishing of data already applied successfully in recent work.43 This tool is built on off-the-shelf software packages for data storage (SQL database) and processing data (Python44 and its diverse data analysis libraries45–49). Thus, this system offers high customizability and simplicity in development considering the maturity of the used software and support by active communities. While the results are widely applicable for research data management, we highlight examples from the field of electrochemistry related to energy materials.
Fig. 1 Illustration of a comprehensive research data workflow integrating acquisition, analysis, visualization and publication. |
Interpretation and visualization of this data produce new information and, therefore new data. Subsequently, new experiments can be planned by the researchers. Additionally, with structured data at hand, ML methods can be readily utilized to explore and exploit the experimental space. Furthermore, structured data is seamlessly added when publishing the experimental findings in compliance with the FAIR principles. This increases the trustworthiness of the interpretation and enables reusability of the data. Finally, structuring data already from the beginning of an experimental study is key to enhance scientific knowledge production.
To achieve this abstraction, different database management systems are available. The most mature and applied solutions are relational databases with developments dating back to the 1970s.50,51 Relational databases organize data in a schema – a rigid network of tables with specified relations between their columns. Beyond that, there are more modern post-relational databases such as key-value, column, document-oriented, and graph databases which are more flexible in handling less structured data, making more versatile connections between data, and are easier to maintain upon changes of the data structures.52 However, in the scope of designing data workflows for experimental setups with specified parameters for metadata and result data, a relational database is an appropriate choice. Considering researchers being familiar with data in tabulated form, the design process of a relational database schema is intuitive and supports a clear definition of a complete metadata set for the experimental setups. Furthermore, in light of the widespread application of relational databases, there is a large community further supporting the design process by the vast availability of solutions for common problems. Nevertheless, as experimental setups and workflows might develop with time, revision of the database schema is required. This indicates the importance of the initial design process already considering possible developments.
Besides that, considering a wide spectrum of knowledge on programming languages in materials research-related degree programs, the interaction with a relational database – selecting, inserting, and updating entries – is intuitively integrated in a researcher's workflow given its simple but powerful structured query language (SQL). Furthermore, this interaction can be constrained, i.e., the change of raw data can be restricted, ensuring the integrity of the original measurement data. Finally, active learning techniques with common surrogate functions such as Gaussian Process Regression and Random Forest require tabulated data, thus, giving an intrinsic compatibility with relational databases.
For the simple case of an electrochemical experiment, a database schema is exemplarily outlined in Fig. 2. This structure is divided into three categories (i) the inventory of physical objects used in the experiment, (ii) the experimental (meta)data itself, and (iii) processing data created during the analysis of the raw experimental data. A complete database schema which is used within our group is added as ESI file† and further described in the ESI in Section 1.†
For each experiment, a new entry in the ec metadata-table is created and linked via an identifier to the samples-table and, thus to its properties. The relation is defined as one-to-many relation, meaning an experiment can only include one sample, while many experiments can be conducted on one sample. Similarly, the user, the used electrolyte, and other metadata parameters are linked.
Often in electrochemistry, multiple techniques are run sequentially in a batch. Recording of metadata, thus, can be defined in a per-batch or per-technique manner. To enable tracking of changes of metadata parameters in between techniques of the same batch, such as the change of a gas purging through the electrolyte or the rotation rate in a rotating disk electrode (RDE) experiment, metadata is recorded on a single technique level. To identify techniques of the same batch, a batch identifier is added to the ec metadata-table.
Metadata parameters specific to the electrochemical technique, such as the scan rate of a CV experiment or the frequency range in an electrochemical impedance spectroscopy (EIS) experiment are stored in separate tables linked to the main metadata table. This schema is extended in the same way to cover a complete metadata set so that all parameters required to reproduce the experiment can be stored in one place. This includes also environmental parameters such as room temperature or relative humidity. Finally, the metadata is distributed over multiple connected tables. By definition of a database view, all metadata can be joined into a single table for each kind of experiment, which decreases complexity when selecting experiments, while keeping quality constraints by the definition of multiple tables.
For the acquired data further tables are defined. Data of the same kind is stored in a single table with an experiment identifier linked to its metadata, in contrast to a file-based system where each experiment produces a new file and a link is created by its name and folder path. In the case of electrochemical data, due to its different domains (time and frequency), a table for direct current (CV, CA, etc.) and another for alternating current experiments (EIS) is defined.
Besides the experimental data, also tables for processed data can be defined. In these, the results of an analysis, such as the extracted electrochemical active surface area (ECSA) from a CV experiment, are stored. Additionally, any parameters introduced during the analysis such as the integration limits are included. Linking processed data tables to the originating experiment ensures data lineage tracking.
Thus, in contrast to a file-based approach, a direct link among metadata, experimental data, and processed data is ensured by design. By that, structuring data in a relational database fulfills more easily the FAIR principles in comparison to a folder-file structure. In addition, organization of the data in a relational database enables simple finding and selecting of specific experiments by its metadata using SQL queries.
In conventional workflows, proprietary acquisition software as delivered by the manufacturer is used to control the experiment. Experimental data and inherent metadata such as the electrochemical protocol are exported into files (see Fig. 3a). To establish the link between the experiment and additional metadata like the used electrodes or electrolyte, a second user interaction is required during the insertion of the file into the database. Similarly, in the case of ELNs, the measurement files are linked to a separately created entry. In addition, in the case of experiments with multiple devices usually multiple software has to be handled, as an integration of other devices is in most cases not possible. Consequently, multiple files with (meta)data have to be inserted increasing the complexity of this workflow.
In contrast, as depicted in Fig. 3b, customizing a single acquisition to communicate with multiple devices and directly with the database offers multiple advantages. (i) (Meta)data acquired by all devices are inherently linked, which simplifies the correlation of their results. (ii) A second step to link additional metadata to existing database inventory tables is not necessary as it can be set directly in the acquisition software before performing the experiment (see Fig. S1†). (iii) No user interaction is required to name, link, and transfer data, enabling remote operation of the experimental setup and, thus, achieving one of the main prerequisites of self-driven laboratories. As these customizations are not available within commercial potentiostat software, this approach relies on using a potentiostat with a programmable interface controlled by a custom-built acquisition software, based, for instance, on LabView or Python.53,54
By using a relational database, the selection of data can be linked to conditions an experiment has to fulfill to be suitable for an analysis routine. For instance, for the derivation of the ECSA, the selection of experiments can be restricted to the type CV and Pt-containing working electrodes. The clearly defined data structure simplifies sharing and standardization of analysis routines across users.
In performing the analysis, some research groups still rely on GUI-based data analysis software such as Origin or Excel. This is problematic in different regards:
• These software store a redundant copy of the raw data and thus the link to the original file can be lost.
• Analysis steps are not recorded, for instance, the ambiguous adjustment of analysis parameters such as potential boundaries for an integration of the current to derive the ECSA-specific charge or manual exclusion of data points to fit a Nyquist plot with an equivalent circuit.
• Required human interaction scales linearly with the amount of data to be analyzed because of unnecessary repetitions of the same steps over again.
In contrast, when performing script-based data analysis, the link between analysis parameters, analyzed data, and raw data can be consistently established by integrating the linking into the scripted routine. Thus, data lineage and the reproducibility of the analysis procedure are guaranteed. By sharing the processing scripts among researchers, the time to develop those scripts is reduced and discussion on the quality of the data analysis procedure can be rooted.
For such data analysis procedures, the open-source programming language Python is feasible, due to its active community46 and range of libraries based on efficient array programming47 that facilitates ready-to-use scientific data analysis procedures.46 Although most of the cutting-edge ML libraries such as XGBoost, TensorFlow, and Torch are coded in low level, hence fast programming languages (i.e., C++, Fortran), the Python wrappers are also actively maintained, usually by the official developers.55–58 While providing the extensive possibilities of a fully-featured programming language, Python is a good choice also for researchers with little programming experience thanks to its ease of use and community-backed comprehensive documentation and guides.
Nonetheless, a strict definition of an analysis procedure can fail for certain experiments depending on its complexity. For example, the unintended inclusion of outlying data points into a fitting procedure or the incorrect choice of an equivalent circuit model for impedance spectroscopy. User blindness can propagate such errors leading to wrong conclusions in subsequent interpretations. Therefore, a quality control step is essential for every analysis procedure. For a low amount of data, this can be a manual step, like visually verifying the suitability of a model to fit the raw data. For larger amounts of data and simple processing steps, specific quality parameters can be defined, and experiments discarded from further analysis if these don't match. For instance, the goodness of fit might be evaluated by the R2adj value. However, this value cannot differentiate between a statistical error (noisy signal) and a systematic error (wrong model), thus, care must be taken in defining these quality indicators. Alternatively, data quality can also be evaluated based on ML models comparing results with existing experiments in the local, relational database or from data repositories.59,60
Result data of the analysis as well as input parameters are stored in pre-defined tables in the database. To ensure data lineage tracking, these tables must (i) contain columns for any input parameter of the analysis, (ii) be linked to the original experiment (iii) as well as to the performed analysis script.61 Especially, if different versions of the analysis routine are developed, a column for the version of the applied script should be added.
To illustrate the advantages of script-based data analysis procedures, examples are outlined in Fig. 4. Therefore, we classify data analysis procedure into normalization, extraction, and correlation of data.
Fig. 4 Classification of data analysis procedures with increasing benefits from normalization to extraction and correlation of data when implemented in a database-integrated workflow. This classification is illustrated by electrochemical experiments on polycrystalline Pt. (a) Normalization of current response of a CV by the geometric surface area of the working electrode as well as referencing electrode potential to the reversible hydrogen electrode (RHE) system. (b) Derivation of the ECSA from CV via the Hupd charge. (c) Correlation of time-resolved electrochemical potential and dissolution of Pt as determined by ICP-MS.62 Experimental details can be found in the ESI in Section 3.1.† |
In our labs, for instance, an electrochemical scanning flow cell (SFC) can be coupled to an ICP-MS. Electrochemical and mass spectrometric experiments are correlated by their timestamps and corrected for a delay time required to transport dissolved species to the downstream mass spectrometer. As dissolution is depending on the absolute surface area, the mass flow rate of dissolved species is normalized by the ECSA extracted from CV (see Fig. 4b). By this, dissolution peak onset, maximum, and shape can be correlated to the electrode potential. For instance, the electrochemical potential of peak dissolution during CV can be derived (see Fig. 4b and S3†). Thus, the electrochemical stability of energy-related materials can be examined in depth within minutes. Once the data workflow is established, also the analysis procedure of the data is performed within minutes and can be reused for any experimental study. This workflow was successfully applied to stability studies of bipolar plate materials for proton-exchange membrane water electrolyzers.43
Having all metadata available, any style elements of the displayed graph such as legend labels, colors, line- or marker styles can be defined by metadata values. Thus, creating a color code for multiple graphs throughout a publication based on e.g. the material is simplified.
Furthermore, having the metadata of all experiments in a tabulated form, experiments can be thoroughly compared and in that way any (unintended) differences in the experimental procedure retraced. For electrochemical experiments, for instance, the electrochemical history of the sample can be compared. With such a comparison, the accordance of experimental parameters between experiments is quickly verified. Thus, the reliability of the data and their interpretation can be improved.
To overcome the low comprehensibility of customized data structures uploaded to data repositories, there are services such as Binder which opens up the possibility to interactively explore figures, underlying data, and analysis routines of a published work.78 This service enables the execution of Python scripts by hosting Jupyter Notebooks uploaded to a Zenodo data repository. Therefore, it is seamlessly combined with the data workflow presented here. To showcase its applicability, data visualized in Fig. 4 is made available online, including raw data, experimental metadata, analyzed data and any analysis tools, routines to visualize the data as well as the underlying Python scripts.79,80 By this, the data lineage from acquisition up to publication can be interactively explored without the need of additional software installation. While this approach is convenient to showcase the development of the data management tool presented here, it is advantageous to enable FAIR access to research data, in general.
This enables cross-platform access via browser, outsourcing of data-intensive calculation to the server, and automatic backup of data and scripts. Technical limitations such as storage ease, capacity and processing speed are considered. Additionally, maintenance of Python libraries is handled centrally hence avoiding compatibility issues. This enables sharing and standardization of data analysis routines and visualization templates which in our experience drastically reduces barriers for researchers with little or no expertise in programming. Building our tool based on MySQL and Python/Jupyter relies on the software being open-source, free to use, and having a large community with huge support opportunities also by using modern Large Language Models. Thus, also research groups with low computer science capacities are able to implement such a system for their experiments. For further guidance an overview of our server infrastructure is illustrated in Fig. S6.†
While we have highlighted the applicability in the field of electrochemistry, especially correlating other techniques to electrochemical experiments, this approach is generally applicable to other fields. Especially in light of limited expertise on IT infrastructure at research institutes, in our experience the presented workflow offers a system simple to implement. At the same time, it remains flexible to be customized for specific needs. Once implemented, also for users with little programming expertise, it is easy to adapt their data management to profit from a comprehensive workflow.
Finally, such a data management tool is, on the one hand, a key element to enable automation in materials research laboratories and the building of high-throughput experimental setups enabling the applicability of ML methods on a laboratory scale. On the other hand, increasing the amount and quality of openly published data will enhance big data analysis on an inter-laboratory scale.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ta06247c |
This journal is © The Royal Society of Chemistry 2024 |