How big is big data?

Daniel Speckhard; Tim Bechtel; Luca M. Ghiringhelli; Martin Kuban; Santiago Rigamonti; Claudia Draxl

doi:10.1039/D4FD00102H

How big is big data?

Daniel Speckhard,

*^ab Tim Bechtel,

^ab Luca M. Ghiringhelli,

^c Martin Kuban,

^a Santiago Rigamonti

^a and Claudia Draxl

^ab

Author affiliations

* Corresponding authors

^a Physics Department and CSMB, Humboldt-Universität zu Berlin, Zum Großen Windkanal 2, 12489 Berlin, Germany
E-mail: claudia.draxl@physik.hu-berlin.de, dts@physik.huberlin.de
Fax: +49 2093 66361
Tel: +49 2093 66363

^b Max Planck Institute for Solid State Research, Heisenbergstraaae 1, 70569 Stuttgart, Germany

^c Department of Materials Science and Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Dr.-Mack-Str. 77, 90762 Fürth, Germany

Abstract

Big data has ushered in a new wave of predictive power using machine-learning models. In this work, we assess what big means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.

This article is part of the themed collection: Data-driven discovery in the chemical sciences

Article information

https://doi.org/10.1039/D4FD00102H

Article type

Paper

Submitted

14 май 2024

Accepted

08 юли 2024

First published

11 юли 2024

This article is Open Access

Download Citation

Faraday Discuss., 2024, Advance Article

Permissions

Request permissions

How big is big data?

D. Speckhard, T. Bechtel, L. M. Ghiringhelli, M. Kuban, S. Rigamonti and C. Draxl, Faraday Discuss., 2024, Advance Article , DOI: 10.1039/D4FD00102H

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Faraday Discussions

How big is big data?

Abstract

Article information

Download Citation

Permissions

How big is big data?

Social activity

Search articles by author

Spotlight

Advertisements