Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction

The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.


Introduction
Understanding a material's behavior requires knowledge about its composition, properties, processing and testing protocols, and microstructure-represented as the materials science (MatSci) tetrahedron (see Fig. 1).These different aspects of a material are reported by researchers in peer-reviewed publications, patents, and other scientific documents.Recently, there have been several attempts to exploit the advances in machine learning (ML) and artificial intelligence (AI) towards automated information extraction (IE) from literature [1][2][3][4].These include the development of materials specific language models [5][6][7][8], rule-based systems [9][10][11][12][13], IE from tables [8,14,15], and IE from images [16][17][18][19].The widely varying information expression styles in research papers makes the automated MatSci IE a challenging task.Most of the works have focused on IE in a specific domain; hence, the transferability to different materials is not explored.Moreover, no consolidated work exists that explores the specific challenges associated with IE in MatSci and the gain associated with solving these challenges, which provides a clear direction to the researchers regarding the areas that require increased attention.
We thoroughly review MatSci articles to identify IE challenges towards completing the materials tetrahedron (see Fig. 1).We also highlight some of the major challenges toward the development of a "universal" MatSci knowledge base linking the extracted information from multiple sources and forms of data-structured, semi-structured, and unstructured.Indeed, millions of scientific documents exist reporting information about various materials known to humans.Thus, the automated development of MatSci IE will lead to a rich knowledge base on materials.The outline of the paper is as follows: First, we explain the methodology of collecting papers for review and annotation process.Then, in the results and discussion sections, we investigate the proportion of each of the entities, such as composition, structure, properties, processing, and testing conditions, reported in tables or text of the articles, followed by the challenges faced in their extraction.We quantify how frequently a challenge occurs to motivate researchers to gauge the amount of information that will be obtained after solving the respective challenges.We further identify the challenges in extracting and connecting the information from text and tables and among different tables belonging to the same MatSci research papers.Note that the challenges reported for extracting compositions from tables are verified by the present IE models, and only those that are unaddressed or solved unsatisfactorily are reported in the main text, whereas some of the existing challenges that have been resolved satisfactorily are documented in the appendix.In our study, DiSCoMaT [8] was employed as the IE model for extracting compositions from tables, recognized as the most effective IE model for this purpose [4].Concurrently, GPT-4 was utilized to extract compositions from textual content in our study.For extracting properties from MatSci tables, we could not find any domain-specific IE model, but we believe that the challenges reported are valid for any IE models.We have also provided reasons and examples to elaborate on the same.Regarding IE from text to complete materials tetrahedron, we have highlighted examples where existing IE models also make mistakes.Finally, we provide some guidelines for presenting machine and human-friendly tables that enable automated MatSci IE from research papers.

Methodology
To elucidate the challenges, we referred to a dataset of 2536 peer-reviewed publications on MatSci.This dataset is taken from recent work on IE from tables [8], where the authors used distant supervision to annotate tables from research papers based on respective compositions present in INTERGLAD [20].The tables in val and test data were annotated manually by indicating the relevant rows and columns that should be used to extract material compositions.Fig. 1 shows different sections of the paper where these different components are majorly reported.The statistics of each challenge were computed by randomly taking 50/100 tables from the manually annotated val and test dataset.In cases where this was not applicable, we further performed manual annotation on an additional 50 papers or 100 relevant tables selected randomly from the corpus.For instance, we randomly selected 100 composition tables from the manual annotation in the existing dataset for composition extraction.However, no such manual annotation was available for properties.For this problem, we selected 100 random property tables from the corpus and manually annotated the frequency of the challenges in property extraction.Note that all the challenges and their reported frequencies are based on manual annotation, which is more reliable than any ML-based technique, such as distant supervision.Further, we manually analyzed tables or text for the occurrence of each of the entities, such as composition, structure, and property.All the results and data associated with the annotation process are shared in the following link.

Results and Discussions
Figure 2 shows the percentage of papers reporting raw materials (precursors), compositions, properties, processing, and testing methods in text and tables.Note that the same information could be reported in both text and table and hence, the percentages may add to more than 100.Although 78% and 74% of papers had compositions in text and tables, respectively, an in-depth analysis revealed that only 33.21% of the total compositions were reported in the text, whereas 85.92% of compositions were present in tables.The overlap exists due to the same composition being mentioned in both text and tables.82% articles report properties in tables (see Fig. 2).Processing and testing conditions are mostly reported in the text, while in 80% articles, precursors are mentioned in the text.In the following sections, we discuss these aspects in detail.

Composition extraction
Since the majority of the material compositions are reported in tables, we first discuss the challenges in extracting compositions from tables.This is followed by the discussion on IE from text.

Extracting compositions from tables
Here, we summarize the major challenges in composition extraction from tables.To this extent, we investigated 100 randomly selected composition tables from the manually annotated data to report the frequency of occurrence of each challenge.a. Variation in table structure and information content: An analysis of 100 random MatSci composition tables revealed that these tables do not follow any standard structure.Accordingly, following earlier schema proposed by Gupta et al. [8], composition tables can categorized into two broad categories-multi-cell composition (MCC) and single-cell composition (SCC).These are further subdivided into tables containing complete information (CI) and partial information (PI).When the entire composition is written inside a single cell, it is classified as an SCC table, whereas when the composition is written across multiple cells of the table by reporting the value of each constituent (compounds or elements) of the composition in separate cells, it is defined to be an MCC table.If the table contains all the information regarding the constituents of the material, they are classified as CI tables (complete information).Alternatively, if only some of the constituents are mentioned in the table for the material, they are PI tables.In the latter case, we need to extend the analysis to the text of the article to extract the full composition.Fig. 3 illustrates all 4 types of tables [21][22][23][24].The most prevalent composition table types are MCC-CI (36%), followed by SCC-CI (30%).PI tables are less common, with 24% being MCC-PI and the remaining 10% being SCC-PI.Note that this distribution may also vary significantly depending on the material types.For instance, it is common practice in alloys to skip the major element while describing the composition in a table.In previous work by Gupta et al. [8], while an F1 score of 78.21% and 65.41% have been achieve for extraction from SCC-CI and MCC-CI tables, respectively, an F1 score of only 51.66% has been achieved for extraction from MCC-PI.Although the researchers have not explicitly focused on SCC-PI, we used their best model for SCC-PI tables and obtained 47.19% as the F1 score.Hence, there is a significant scope for improvement in extracting compositions from PI tables.
b. Presence of nominal and experimental compositions: While the nominal composition is the amount of chemicals taken initially to prepare the material, analyzed/experimental composition refers to the actual composition of material obtained after analyzing the manufactured material (see Fig. 4(a)) [25,26].Our analysis revealed that in 3% of the tables, both nominal and analyzed/experimental compositions are reported.These values are not reported in any fixed pattern, making it difficult to correctly separate the nominal and analyzed compositions after extraction.c.Compositions and related info inferred from other documents: In some tables, the details of the glasses studied are not explicitly mentioned; rather, references to previous research publications which use the same material are provided in the tables or their captions (see Fig. 4(b)).Thus, the composition or the other associated information of the material which is missing in the current publication must be extracted from the cited work, which then must be combined with the relevant information of the material present in the current work.We found references about different entities of the material in 11 tables [27,28].4 out of the 11 tables have not explicitly mentioned compositions, due to which the IE model [8] was unsuccessful in obtaining the desired compositions.d.Composition inferred from material ID: We observed that 10% of the total composition tables contain IDs with essential material composition information.In 60% of these tables, DiSCoMaT [8] Figure 5: (a) Table with composition mentioned as acronyms in ID (first column).(b) The value of variable 'M' needs to be inferred from the material IDs.failed to extract the compositions correctly.Most of these tables did not mention the materials' composition separately, thereby making the extraction challenging.For example, some of the materials have their compositions indicated within the IDs in an abbreviated form [29] and did not mention them explicitly (see Fig. 5(a)).We also found tables where the composition of the materials is not specified; instead, their standard names are used as IDs.Such examples include Wollastonite or Diopside [30], which have a fixed chemical composition that can be obtained from standard sources/databases.In some cases, the composition was specified separately, but the IE model failed to extract the composition correctly due to dependency on material IDs, as shown in 5(b).Here, the variable 'M' needs to be substituted by elements like 'W', 'Nb', or 'Pb', which needs to be inferred using the material IDs mentioned in the first column of the illustrated table [31].
e. Variables used to represent compounds: When a composition is expressed with variables such as (70 − x)TeO 2 +15B 2 O 3 +15P 2 O 5 +xLi 2 O, where x = 5, 10, 15, 20, 25 and 30 mol% [32], it mostly denotes the variation of different compounds.However, in some articles, variables have been used to represent compound names instead of their values.One such example is RE 36 Y 20 Al 24 Co 20 where RE = Ce, Pr, Nd, Sm, Gd, Tb, Er, Sc [33].This scenario is observed in 1% of the tables, where DiSCoMaT [8] fails to extract the material compositions.Note that this particular case can be solved using GPT-4, but as DiSCoMaT performs better in composition extraction from tables than GPT-4 [4], and a pipeline of GPT-4 and DiSCoMaT is not feasible, hence, this still remains an open challenge.

Extracting compositions from text
Now, we discuss the challenges in extracting the compositions reported in the text of the MatSci research papers.We report our statistical findings based on the frequency of each challenge.We also use GPT-4 to extract the compositions from text.The prompts given to GPT-4 for composition extraction are provided in Table 2. Specifically, we have used gpt4-1106 model through the OpenAI Python library.The temperature was set to 0.0 for reproducibility.

a. Different formats of compositions:
The compositions in materials literature do not adhere to a predetermined pattern and encompass several variations.This is in strict contrast to notations in chemistry, where IUPAC nomenclature is used.Some notable examples are as follows.
3. "The samples having chemical composition of 2(Ca,Sr,Ba)O-TiO 2 -2SiO 2 were examined.CaO, SrO, and BaO contents in the samples were varied as shown in Table 1.RO% shows the molar percentage of CaO, SrO or BaO in total RO of CaO+SrO+BaO" [36].
Although GPT-4 understands the doping element, since the entire information is not present in the same sentence and the exact values of doping content are not specified, it does not extract the composition successfully.
Here, the x values representing the compositions and the respective variable names are present only in the text.Appendix 9.2.2(c) shows a few instances of other composition formats with variables.However, it may be noted that if full information is present in the sentences, GPT-4 is able to extract information correctly for the cases where the compositions are given in the form of variables.
c. Low recall in extracting compositions expressed with variables: 28% of the articles have compositions written with variables, of which 28.57% does not provide any values for the variables in the text.Among the 71.53%where values are present, 40% of them do not mention the step size for the range of values taken by the variable.For example, consider the text representing a set of compositions as follows from a manuscript: x(0.75AgI:0.25AgCl):(1-x)(Ag 2 O:WO 3 ), where 0.1≤x≤1 in molar weight fraction.[38] The step size of 0.1 is mentioned nowhere in the text but could be inferred from the composition table present in the paper.Therefore, extracting only from the text in such cases leads to more errors, and this can be resolved by connecting the variables to the correct composition table containing the variable.GPT-4 takes the endpoints for substituting the values in the compositions.However, due to a lack of information, it does not extract complete compositions due to the lack of values between the extreme values.
d. Recognition of full forms and abbreviations: Instead of providing precise composition values, full forms are employed instead of abbreviations.Consider the following example.
"Lithium disilicate glass was prepared in 30 g quantity by heating stoichiometric homogeneous mixtures of lithium carbonate (99.0%),Synth, and silica (99.9999%),Santa Rosa, for 4 h at 1500 °C in a platinum crucible."[39].This text indirectly mentions the glass's composition as lithium disilicate without clearly mentioning the percentages or numbers.GPT-4 is able to infer the chemical formulas from chemical names but cannot infer the exact composition and its percentages from the sentence.
e. Unstable and irrelevant composition extraction: Unstable reagents and other irrelevant compositions which does not refer to the material are also identified as compositions due to a lack of robust parsers.AlO 4 is an unstable entity referring to the aluminum tetrahedral structure, while SiO 2 can be a composition.These undesired extractions can lead to a huge drop in the precision of the IE model, and separating them from the material composition is not easy.Only a domain expert, with the help of the source article, can confirm whether the extraction is relevant or not.GPT4 fails to differentiate compositions from unstable compounds.
It is worth noting that although GPT-4 can address some of these challenges, especially extraction from text, its closed nature makes it challenging to use it at scale and for custom applications.Some of the reasons are: 1. Often, the research documents could be highly sensitive, preventing their sharing with commercial models such as GPT-4.2. The inability of GPT-4 to be combined with smaller predictive models like DiSCoMaT prevents exploiting excellent domain-specific models that extract information very accurately.
3. The commercial nature of such models can make it prohibitive due to the expenses associated with the usage due to the large number of sentences to be analyzed in the research papers and any additional prompt-engineering involved.
Therefore, GPT-4 may not be an ideal baseline for IE at large scale from research publications.

Extracting compositions from table and text jointly
Extracting information from PI tables is more challenging than extracting from CI tables, as the incomplete information in the table regarding the composition should be inferred from the text.A detailed analysis of 50 PI tables revealed that 36% of the tables have unique challenges and are not "regular".To clarify this point further, we discuss some of these challenges below while also defining a "regular" MCC-PI table in Fig 14 .We have cross-checked all the reported challenges in this section by using the DiSCoMaT [8], the best IE model for composition extraction from MatSci tables [4], which also handles PI tables; and found that the model was unsuccessful in extracting composition from tables having these characteristics.
a. Unusual variables used: Other than the common variables like x, X, y, z, and Z, we also encounter variables like R, A, Y and S in 4% of the manuscripts.Distinguishing some of them, such as S or Y, is difficult as they are valid symbols for chemical elements as well [40].

b. Composition present across multiple columns:
The composition of the material is spread across multiple columns/rows (instance depicted in Fig. 6(a) [41]), or the table does not follow any fixed orientation.This is observed among 4% of the PI tables.

c. Composition partly in the table and partly in text:
Although PI tables contain the composition partly, it is expected that the complete information is available in the text.But in rare occurrences, as depicted by Fig. 6(b), we observe that only the remaining part of the composition, which is not mentioned in the table, is present in the text.This makes linking the parts of compositions in the text and tables challenging.Thus, extracting the whole composition is extremely difficult, a case seen in less than 1% of the PI tables [42].

d. Presence of multiple variables:
We found 6% of the PI tables having more than one variable, all of which need to be taken into account to extract the composition correctly.As discussed previously, variables can be of various forms, making extracting multiple variables a challenging task [43,44].

Extracting properties from tables
Until now, we focused on the extraction of compositions from tables and text.In this section, we discuss the challenges with property extraction.To this extent, we analyzed 100 arbitrarily selected property tables.The observations based on this analysis are as follows.
a. Semantically similar row/column headers: 19% of the tables have similar abbreviations or headers with similar descriptions for different properties.For example, in Fig. 7(a), the headings of the columns are T g , T x1 , T x2 , T x3 , ∆T x , T m [45].Identifying the desired property by a predictor model or someone without domain knowledge can be difficult in this case.
b.The same property measured under different conditions: The same property can be measured with different techniques or under different conditions.Therefore, it is important to extract the correct contextual information related to the reported property.Some recurrent scenarios include witnessing tables with various refractive indexes (RIs) at different wavelengths [46] (see Fig. 10), glass transition temperatures at different heating rates [47], or hardness at different testing loads.We encountered 9% of the property tables exhibiting this challenge.c.Information in caption/footer instead of tables: Often, properties are mentioned with abbreviations in the headings of tables, which are semantically close to other properties (for example, Fig. 7(a)).The information regarding their abbreviation is commonly found in the caption or footer of the table.We observed 30% of the tables having this characteristic [48,49].Further, 2% of the tables have no information on the properties units.However, these are found in the caption or footer of the tables [50].Hence, text from these sections might be handy for extracting our desired properties.
d. Property recorded under various acronyms: It is a common practice to record property names with their abbreviations.Some properties can have various abbreviations like density is represented with either ρ or d, Young's modulus with YM or E, and activation energy with E 0 , Ae, or E a .
e. Identical acronyms representing different entities: We encountered tables (see Fig. 7(b,c)) where the commonly used acronyms are used to represent different entities; not the usual property they generally represent.For example, 'n', which is mostly used to represent RI, is also used to represent equation parameters specific to the experiments.Another commonly seen instance is 'd' which is used to represent density [51] and has also been used to represent fractal bond connectivity [52], lattice parameters, and equation parameters.This suggests that using a string-matching IE algorithm can result in poor performance in such cases.
f. Range of values (min-max) given instead of mean value: In very few cases (< 1% tables), we encountered property values reported in range rather than a single value.For example, the values of T g are reported in the range 930-945 • C [53].Only a domain expert would know which value to take for a corresponding property between the min, max, or mean of the documented values.This might depend on the property or the application intended to be used, and will also be reflected in the IE algorithm.

Challenges common for both composition and property extraction:
Thus far, we discussed the challenges faced during composition extraction in 3.1.1and property extraction in 3.2 from tables.However, some challenges arise in either of these scenarios.
a. Same composition or property represented with different units: Tables are sometimes (2%) presented with the essential information recorded in multiple units in different columns/rows.Fig. 8 shows a composition table having composition in both mol% and wt% [54], and a property table having glass transition temperature (T g ) mentioned in both • C and K [55].This can lead to duplication of the extracted data.
b. Multiple ways of reporting the same unit: Despite the well-known and accepted conventions for writing the SI units [56], research publications resort to multiple ways of reporting the same  unit.For instance, for g/cm 3 , several variations are observed in peer-reviewed publications such as gm/cm 3 , g.cm −3 , g/cm 3 , gcm −3 , g/cc, gm/cc, gw/cm 3 , gm cc −1 .Similar observations are made for kg/m 3 , where variations such as kgm −3 , kg/m 3 , kg m −3 are presented.Extracting the correct unit and normalizing it to a standard form is an essential task.Thus, while there are standard rules for writing SI units, it is observed that these are not strictly followed in scientific publications.

c. Multiple tables merged in one:
A rarely seen challenge (<1%) is illustrated in Fig. 9, where many tables are concatenated in a long or broad table, which leads to difficulties in extracting the required details [57].
Note that none of these challenges could be solved using the IE model DiSCoMaT [8] and GPT-4.

IE for manufacturing and characterizing materials
To identify the challenges in extracting precursors, processing and testing conditions, and material structure, we analyzed 50 arbitrarily selected papers from the dataset for reporting our findings.
a. Precursor extraction: A research paper generally investigates materials of a similar kind.Hence, it has to be assumed that all the materials are manufactured using the same precursors.In research papers where batch composition is mentioned in tables, the challenges are similar as mentioned in Section 3.1.1(b).In papers where researchers discuss the patented materials, they refer to them by their trademark name, for example, Pyrex, BOROSIL, Gorilla, etc., and hence their precursor information is not provided.However, papers discussing materials reported in previous publications, provide references to those papers reporting the required information in detail.

b. Processing conditions extraction:
Processing conditions reporting could be extremely nonlinear and convoluted.Consider the set of sentences describing the processing conditions [58] as follows.". . .powders were weighed and mixed thoroughly before being transferred to a 90 Pt/10 Rh crucible, heated at 320°C and maintained between 1000 and 1400°C depending on composition, for approximately 25 min.After annealing for approximately three hours, the glass was allowed to cool slowly to room temperature. . .". Hence, the challenges here are to extract temperatures and duration for each process, like heating, annealing, and cooling, along with the environmental conditions and experimental apparatus.Sometimes, these conditions are also mentioned in the table (see Fig. 10), and their extraction poses similar challenges as described in Section 3.2(b).

c. Testing conditions extraction:
The testing conditions mainly comprise the sample characteristics, dimensions, test name, instrument name, instrument settings, and testing variables like temperature, wavelength, load, frequency, pressure, etc.Consider the following excerpt from [59]: "The porous microstructure of the matrix was investigated by scanning electron microscopy (SEM) (JEOL JSM T330A), by infrared spectroscopy (IR) in a FT-IR spectrometer (Perkin Elmer Spectrum 2000), and by X-ray powder diffractometry (XRD) (Siemens D-5000).The phase separation process was investigated by Raman microscope.The room temperature Raman measurements were performed through Raman imaging microscope (Renishaw) system 3000, with the 632.8 nm He-Ne laser line for excitation".The boldface text indicates the information to be extracted for obtaining a complete understanding of the testing process of a material.Fig. 10 lists different wavelengths at which a material is tested to obtain refractive index.The challenges faced in IE for this case will be similar to the ones posed in Section 3.

2(b).
Material structure: To study the structure of materials, researchers perform X-ray diffraction studies, obtain the Raman spectra, optical micrographs, and scanning electron micrographs depending upon the depth of detail about the material structure required.This information is mostly reported in figures and the figure description in the text provides some important details about the material's structure.In the statement, "The Raman spectrum of the porous phase (Fig. 6(b)) shows only one band at 277 cm −1 assigned to silica vibrations..." [59], the information about Raman spectra is already shown in the graph, and the text mentions only critical findings.
To summarise, the extraction of precursors, processing, and testing conditions from text poses challenges related to named entity recognition and relation extraction, which requires the need for specialized datasets and model development.there exist several materials science domain-specific models capable of extracting this information but their performance (F1-Score) on different types of desired entities ranges from as low as 33% [5] (interlayer materials for batteries, taken from SOFC dataset [60]) to 93% [61](materials tag, taken from MatScholar dataset [62]).There also exist some knowledge graphs created using these tools like MatKG [63], however, the quality of the information in such sources is as good as the underlying model.Further, on relation-extraction tasks, the bestperforming models have an F1 score of 0.82 [64], which indicates significant efforts required to facilitate the information extraction and complete materials science tetrahedron.Further, the extracted entities should be linked with the respective materials.The challenges faced during IE from tables for processing and testing variables require overcoming similar challenges as explained earlier for composition (Section 3.1.1)and properties (Section 3.2).

MatSci Knowledge-base: Linking extracted information
The tetrahedron, as shown in Fig. 1 will be considered complete for a given material if its properties, processing, testing conditions, and raw materials required to manufacture are available.To this end, researchers need to link extracted compositions with these variables.These pose unique challenges as it requires linking information among different entities within the paper such as connecting different paragraphs of the paper, text with tables, or tables with other tables in the paper.
Material IDs are required to link information across multiple tables.For instance, in Fig. 11 [65], we obtain the composition of CAS1 from Table 1 and T g of this material from another table(Fig.11 Table 2.).Every material in an article should have a unique ID, which should be used consistently across the whole article to denote the corresponding material.Any exception to this will lead to difficulties in linking our extracted information.We detected 187 out of 2536 (7.37%) publications where inter-table IE is necessary and found difficulties in 81 of them while connecting the different components of the tetrahedron.28.40%) documents having this challenge, where compositions of the materials and their corresponding properties are reported in separate tables, but neither of the tables have any ID present denoting the material.[68].c.One of the tables does not contain material IDs: While connecting two tables, there are cases where IDs are mentioned only in one table [69] (37 out of 81 (45.67%) papers with this challenge).
As we observe that material ID is a very important factor in connecting tables, we did an intensive analysis of the type of IDs that are reported in the tables (see Table .1).

ID Analysis
As material ID is the key component in connecting materials from tables to text, across two different tables, or also across different sections of the text, we investigated arbitrarily selected 50 articles containing material IDs in the tables and recorded their semantic pattern to observe the semantics used by authors to refer to materials.We found that a majority of the authors prefer to use acronyms or self-made codes as IDs for referring to the materials, followed by natural numbers and standard material names, illustrated in Fig. 12. Material IDs are generally present at the beginning of the table and very rarely seen in the middle or end.Often, we come across tables having IDs that contain relevant information like the processing conditions of the material, or information about the state of the glass like amorphous or crystalline, or its composition, which are not separately mentioned in the table.These information are generally encoded as abbreviations, and extracting them can be challenging.In Table 1, we describe different cases along with the percentage occurrences.Note that the composition of the material should not be confused with its ID, as both are separate entities.An ID is expected to be shorter in length, most likely an acronym, and unique to each material.Table 1: Different challenges in extracting information from material IDs and their occurrences.

Guidelines for writing IE-friendly MatSci tables
Tables should be reported in such a way that automated extraction and the detection of the desired information are easy.Some of our suggestions are as follows (illustrated with Fig. 13, adapted from [70]): a. Use column orientation: Many IE algorithms that have been developed for tables have considered column orientation only.Also, we showed that 93% of the published tables are column-oriented.The following suggestions assume that we are following column orientation.b.Use MCC-CI tables: Tables should have the components associated with a composition written in different cells.Moreover, the table should have the complete information of the material compositions (see Fig 3).c.Use proper and descriptive headers: The headers should contain the chemical formula of the compounds or elements that make up the materials, along with the acronyms of the reported properties, with processing, and testing conditions.If precursors, processing, and testing conditions are common, they can be omitted from tables.d.Use standard notations for units: Units should be mentioned in the column headers of the tables within brackets.Moreover, the standard notations for representing the SI unit should be consistently used.
e. All-in-one table : Prefer writing all the information of a particular material in a single table while following proper orientation.Following this will avoid the need for inter-table extraction.f.IDs are mandatory: Material IDs are important to identify different materials mentioned in the tables and link them across tables and text.IDs should be mandatory for tables and written in the first column.g.Consistent IDs: Material IDs should be formed as an acronym of its comprising constituents.They should be consistent in the whole article, that is, there should not be more than one ID referring to the same material.h.

Conclusion and future work
The literature is replete with IE challenges and algorithms to extract information about materials.However, there exists no study that quantifies how much benefit can be obtained if a particular challenge is solved.In this paper, we have identified and quantified several unresolved challenges present in IE for every aspect of the MatSci tetrahedron.Specifically, we pointed out the locations in a MatSci research paper where each piece of information on the MatSci tetrahedron of a given material is reported.Further, we outlined the challenges associated with IE and linking them to build the MatSci KB.We hope this extensive analysis will motivate researchers to focus on the challenges in the field, giving an insight into the gain associated with each of these challenges.This will also enable the researchers to identify the right problems to focus on based on the desired outcome.Finally, we provided recommendations for an IE-friendly table format to enhance the automated extraction of the desired information and improve the researchers' tabular understanding.Such concerted efforts are required to streamline the reporting in MatSci articles, thereby accelerating IE for materials discovery.

Conflicts of interest
There are no conflicts to declare.

Appendix
In this section, we will address some more notable challenges, most of which have been solved satisfactorily by IE models.
The details of all the research papers used in this study, along with annotations to identify the challenges, are available at https://github.com/M3RG-IITD/MatSci-IE-Challanges

Common Challenges faced during information extraction from tables
We begin by discussing the problems encountered for all-encompassing IE tasks.Challenge a has been resolved in [8] while challenge b has been addressed by [71] [8].
a. Distractor rows or columns: Additional contents in the table that are irrelevant to our desired information.
b. Different orientations of tables: Each table can have either of the two orientations -row or column, which is essential to recognize for extracting information precisely.We saw 100 random composition tables and 100 random property tables and observed that 7% of the tables are represented with row orientation (see Fig. 10), whereas 93% of the tables are represented with column orientation (see Fig. 14).We start by illustrating a typical MCC-PI [32] in Figure 14 table without any challenges for the reader's convenience.a.One Composition with multiple units: Consider the following example composition -0.85TeO 2 +0.15WO 3 +0.1wt%Ag 2 O+0.076wt%CeO 2 [72].Here, for a given material, different components are measured in different units (mol% and wt%).This is found in 2% of the tables making composition extraction challenging.
b. Composition in table headers: Most tabular IE models like Tabbie [73], DiSCoMaT [8] perform better when row/column headers contain appropriate information regarding its contents.In MatSci tables, the headers are mostly material IDs, compound names, properties, processing and testing labels, and the inner cells contain corresponding values.However, in 6% of the tables, we found that the compounds with their values were present in the heading, which makes it hard for the IE models to extract the desired information.For instance, Se 58 Ge 33 Pb 9 [74] or x = 10%, x = 20%,... [75] are column headers which contain both the compounds and corresponding concentration in the heading.67% of these were SCC-CI, whereas the rest 33% were MCC-PI tables.
c. Composition expressed with different units in various articles: such as mol%, weight%, atomic%, mol fraction, weight fraction, atomic fraction, and ppm.Among them, the most commonly used unit is mol%, followed by weight%.
d. Percentage not equal to 100: In some papers, even after extracting the whole composition correctly, we observe that the sum of the chemical component concentrations is not equal to 100, whereas we also notice the presence of the scenario where composition is extracted incorrectly and the sum is equal to 100.Especially in the case of doping, the sum exceeds 100, which is correct.The challenge is to identify where we need to normalize the values extracted and where we should not.We noted that dopant is reported in 2% of the composition tables.

From text
Both a and b are unsolved.In challenge b, we do not know whether the extracted composition needs to be normalized or it is partially extracted.Normalization is not a challenge after correct extraction as there are existing works on it [8], but currently, no work has been done on extracting the composition completely if it is not fully mentioned in the text.
a. Unit not mentioned: 39.53% compositions had no unit specified explicitly.3. Glasses with composition in mol%: 51ZrF 4 , 16BaF 2 , 5LaF 3 , 3AlF 3 , 20LiF, 5PbF 2 have been prepared by melting of the powders (commercial raw materials of purity higher than 99.99%) in a covered vitreous carbon crucible at about 850 °C for 45 min in a dry argon glove box with a water content lower than 5 ppm.The melt was poured into a preheated copper mould at 240 °C and slowly cooled down to room temperature.The doping ion was added in excess to the formula +xErF 3 from 0.01 to 11 mol% corresponding to 0.02 to 22 × 1020 Er3+ ions/cm3.The samples obtained were of good optical quality [78].

From table and text jointly:
Figure 15: Variable 'x' is not in table.** a. Variables representing composition in text not found in tables: A generic way of extracting the composition when an arithmetic equation of the composition containing variables is mentioned in the text is to connect it to the variables located in the headings of the table and substitute it with values mentioned under it.Extraction becomes difficult if there is an absence or mismatch between the variable name in the table and the text, as shown in Fig. 15.We found 8% of the tables posing this challenge [79].This challenge has been resolved in the IE model proposed by Gupta et.al. [8].** Please note that we have mistakenly added Fig. 14 as Fig. 15 in our Digital Discovery Publication, although we have cited the correct source in the text.We sincerely apologise for this mistake.GPT-4 is able to extract information correctly for the cases where the compositions are given in the form of variables.

Prompt GPT-4 Response
Conclusion "Extract all the compositions from the following expression.
GPT-4 understands the doping element, but since the entire information is not present in the same sentence, as well as exact values of doping content is not specified, it is able to give just the partial information.

Figure 1 :
Figure 1: Quantifying challenges in information extraction from different elements of a research paper such as text, tables, and figures.

Figure 2 :
Figure 2: Occurrence of information regarding precursors(raw materials), compositions, properties, processing, and testing conditions in MatSci papers.

Figure 3 :
Figure 3: Classification of composition tables in single-cell composition (SCC) and multi-cell composition (MCC) with complete information (CI) and partial information (PI).

Figure 4 :
Figure 4: Example of tables: (a) mentioning nominal (batch) and analyzed composition, (b) having references to other papers

Figure 6 :
Figure 6: (a) Composition across multiple columns.(b) Partial composition in the table, rest in the text.

Figure 8 :
Figure 8: (a) The same glass composition mentioned in both mol% and wt%.(b) The same property of a material is mentioned multiple times with different units.

Figure 9 :
Figure 9: Multiple tables concatenated to form a larger table.

Figure 10 :
Figure 10: Challenges related to extraction of processing condition (heat treatment time) and property (refractive index) reported under various testing conditions (wavelength).

Figure 11 :
Figure 11: Composition and properties of the same material are mentioned in different tables within the article.

Figure 12 :
Figure 12: Writing styles of IDs in MatSci articles.
Challenges in IE from material IDs % of occurence Composition info/doping conc.present only in IDs 20 IDs present in the middle 2 Multiple IDs present for the same composition 4 State or structural info in ID 2 Info or references about the processing conditions 8 Same IDs but different composition 4 The article contains IDs interconnected 2 Taken from other articles 6

9. 2
Other challenges faced in composition extraction: 9.2.1 From tables:

b. 1 . 2 .
Percentages not summing to 100: Out of 78% compositions found in the text, 17.94% of them did not have the sum of values of the chemical compounds equal to 100.c.Different formats of compositions with variables: A few instances of different formats of compositions expressed in variables are: The non-isothermal crystallization kinetics of xLi 2 S-(1-x)Sb 2 S 3 , x=0-0.17 were investigated using differential scanning calorimetry (DSC).[76]To ascertain the effect of the glass composition on fluorescence parameters around 1.86 µm, we prepared and experimented on two series of glasses.The first one was aR1  2 O(1-a)TeO 2 where 'a' was 0, 10, 15, 20, 30 mol%, and 'R 1 ' was Li, Na, K.The second one was bR 11 O.cR 2 111 O 3 (1-b-c)TeO 2 where 'b' was 0, 10, 20, 30 mol%, and 'c' was 0.5% or 16.5%, and 'R 11 ' = Ba, 'R 111 ' = Al, Ga, or In.To find the effect of concentration quenching, the concentration of thulium oxide was varied from 0.01 to 5.0 mol%[77].