Gilles
Niel
*a,
Fabrice
Boyrie
b and
David
Virieux
a
aInstitut Charles Gerhardt Montpellier (ICGM), UMR 5253 CNRS-UM-ENSCM, Ecole Nationale Supérieure de Chimie, 8, rue de l'Ecole Normale, 34296 Montpellier, France. E-mail: gilles.niel@enscm.fr
bInstitut Charles Gerhardt Montpellier (ICGM), UMR 5253 CNRS-UM-ENSCM, Université de Montpellier, Place E. Bataillon, 34095 Montpellier, France
First published on 1st September 2015
A comparative study of the three main chemical information systems (Scifinder, Web of Science and Scopus) was performed by studying the indexing policies of titles, abstracts and keywords within selected literature articles. Various chemical expressions were introduced as topic searches to illustrate the different search tools related to term indexing. The resulting article lists were compared two-by-two by means of a script designed to identify common reference lists and specific ones to each editor. Analyzing these specific reference lists reveals that only partial coverage areas of references should be expected when querying a single platform. The discussion covers the term and keyword indexing policies, their influence on the retrievability of references and on the retrievability of the highly cited papers.
Term indexing has received much attention for many years from the herein compared information systems. The CAS indexes journal articles, among other document types, since the beginning of the twentieth century in a highly hierarchical way. The bibliographic CAplus database contains currently more than 40 million records covering a wide range of chemical domains including biochemistry, organic, macromolecular and applied chemistry as well as inorganic, analytical and physical chemistry.6 The CAS's title coverage comes close to 10000 titles among them 1700 key journals are gathered to form a core journal list.7 From the outset CAS's indexing policy is document-oriented by the CAS that provides indexed terms from titles, abstracts and author keywords to a large extent in the CAplus database. In its current version, supplementary information using a hierarchical set of controlled terms is also provided.8 At the top level of the hierarchy a reference is first associated with one of the 80 CAS's sections and then indexing is divided into three main categories: concepts, substance related information and supplementary indexing terms.9 The concept category contains one or several subject headings at the first level and then terms or text-modifying phrases at the second level, both levels constituting the controlled vocabulary. Supplementary terms are keywords added by the editor that may be either different from controlled terms or may be excerpted from author keywords. The substance related information is categorized in a similar way i.e. the first level displays substance identifiers such as the Registry Number, the common chemical names linked with the official chemical name. The second level consists of index terms excerpted from the controlled vocabulary and also from CAS's specific terms such as substance roles.10 Thus this powerful indexing relies on both CAplus and Registry11 databases enabling the user to retrieve a large reference set while using a text-only querying language.12 Moreover Scifinder, the CAS's web interface, enables reference searching from both CAplus and MEDLINE13 databases, the indexing of the latter relying on the National Library of Medicine's controlled vocabulary thesaurus, named Medical Subject Headings (MeSH).14
Among the whole WoS's databases, the Science Citation Index Expanded™ (SCIE) gives access to more than 40 million records from a large range of scientific domains.15 The 8500 indexed journals cover a larger set of scientific domains divided into 182 categories related to mathematics, physics, chemistry, biology, medicine, engineering, etc.16 Besides the title, abstract and author keyword fields, WoS provides ESI,† gathered in the Keywords Plus® field.17 This information results from an algorithmic process that excerpts terms appearing at least two times in the titles of the cited references of a processed article.18
SciVerse Scopus19 indexes more than 21000 titles in all scientific topics classified into four domains: social sciences, physics, life and health sciences.20 These two latter domains are especially well represented and the total record number comes close to 50 millions today. The term indexing policy includes titles, abstracts, author keywords as well as matched terms. These matched terms include chemical names, CAS Registry Numbers, trade names, manufacturer names and index keywords. These index keywords form the hierarchically controlled vocabulary gathered in several thesauri such as the Compendex index,21 EMTREE index,22 MeSH, Species index, and GeoBase subject index.23 This list is non-exhaustive but refers to the main indexes concerned by this comparative study.
A second important factor concerns the query language and the related query tools. The CAS introduced gradually the use of a natural language to process queries by developing a computed generation of index entries from natural language phrases.24–26 In recent years this led to the natural language query (NLQ) system, an algorithmic process that breaks down phrases into concepts.27 Different instructions of the process were first described by J. Williams28 then thoroughly analyzed by A. Ben Wagner.29 The last step of the algorithmic process consists in truncating any remaining term that is not parsed in a prior instruction, thus the term ‘organocatalysis’ will furnish references containing the terms: organocatalysis, organocatalyst(s), organocatalytic, and organocatalys(z)ed. The main characteristics of the NLQ system lie in avoiding: (i) the use of Boolean operators that are interpreted like prepositions,30 (ii) the use of proximity operators, and (iii) any knowledge about specific field searches. Prepositions are only used to break down phrases into simpler concepts. The NLQ process enables the end-user to focus on the scientific content owing to an easy-to-use topic search interface that may appear simpler at the outset by comparison to those of WoS or of Scopus. Both latter editors provide either basic or advanced search modes that enable searches on specific fields. WoS and Scopus provide a more classic use of Boolean operators including proximity operators thus giving the searcher a higher precision on the queried expressions. In advanced query mode, many different search fields of WoS and Scopus are searchable using a quite simple syntax based on field codes.
To assess the influence of some factors such as term indexing and journal coverage, we selected some single terms or short expressions that attempt to be representatives of different chemical domains such as organic and inorganic chemistry, analytical and physical chemistry, chemistry related to energy and fuels or materials science, biochemistry and molecular biology, and biotechnology and biochemical research methods (see ESI,† Table S1). All selected terms and expressions were submitted to the query interfaces of Scifinder, WoS and Scopus and the resulting hit sets were thoroughly analyzed.
Table 1 displays the Scifinder's specific queries corresponding to some selected terms and expressions and then the whole filtering process towards the selection of unique articles. Thus column 2 displays the queried terms as they were typed in the Research Topic form of Scifinder's interface and column 3 specifies which candidate list was chosen at the next step unless otherwise noted. Filtering by year and language leads to crude hit counts (column 4). Column 5 displays the hit counts after combining answer sets when required. The citation column 6 refers to all citations after automatic removal of duplicates from the CAplus and Medline databases while column 7 corresponds to article counts after the selection of document types such as journal articles, book chapters, conference papers, notes, letters and reviews. In some entries, discarding patents from this study involves a dramatic decrease between the citation column and the article one. Other document types, i.e. meeting abstracts, errata and corrections, were discarded from citation lists by means of a script, named Iddup, that will be described below. Unique articles in column 8 result from parsing each reference list by this script so that each list does not contain any duplicate reference. The differences between the reference counts of columns 7 and 8 result from incomplete duplicate removal between the CAPlus and Medline and from some errata that could not be filtered during the document type selection.
Entry | Queried expression | Candidate | Hit countsa | Combine answer setsb | Citationsc | Articlesd | Unique articlese |
---|---|---|---|---|---|---|---|
a Crude hit counts after filtering by year and language. b After combining answer sets when required. c After automatic removal of duplicates from the CAplus and Medline databases. d Selection of some document types. e After parsing by the Iddup script. | |||||||
1 | Allene | The concept “allene” | 366 | 917 | 771 | 630 | 585 |
2 | Allenes | The concept “allenes” | 879 | ||||
3 | Organocatalysis | The concept “organocatalysis” | 1034 | — | 827 | 667 | 603 |
4 | Peptidomimetics | The concept “peptidomimetics” | 726 | — | 610 | 314 | 305 |
5 | Agostic interactions | The concept “agostic interactions” | 75 | — | 59 | 53 | 51 |
6 | Battery electrodes | The concept “battery electrodes” | 1554 | — | 1493 | 807 | 806 |
7 | Graphene biosensors | The concept “graphene biosensors” | 119 | — | 95 | 89 | 87 |
8 | N-Heterocyclic carbene | “N-Heterocyclic carbene” as entered | 668 | 793 | 671 | 508 | 450 |
9 | N-Heterocyclic carbenes | “N-Heterocyclic carbenes” as entered | 231 | ||||
10 | Modified nucleoside | The concept “modified nucleoside” | 199 | 213 | 153 | 106 | 98 |
11 | Modified nucleosides | The concept “modified nucleosides” | 213 | ||||
12 | Phosphine ligand | “Phosphine ligand” as entered | 190 | 378 | 330 | 274 | 253 |
13 | Phosphine ligands | “Phosphine ligands” as entered | 225 | ||||
14 | Renewable feedstock | The concept “renewable feedstock” | 211 | — | 187 | 113 | 113 |
15 | Copper (Cu) catalyzed arylation | See text | 76 | — | 65 | 53 | 51 |
16 | Hybrid materials and nanoparticles | See text | 266 | — | 217 | 179 | 177 |
17 | Viscosity of ionic liquids | The two concepts “viscosity” and “ionic liquids” were present anywhere in the reference | 509 | — | 429 | 343 | 335 |
18 | Band gap in solar cells | The two concepts “band gap” and “solar cells” closely associated with one another | 554 | — | 526 | 431 | 430 |
19 | Statistical analyses of DNA microarrays | References were found where the two concepts “statistical analyses” and “DNA microarrays” were present anywhere in the reference | 154 | — | 150 | 98 | 93 |
20 | Surface area in mesoporous materials | The two concepts “surface area” and “mesoporous materials” closely associated with one another | 372 | — | 346 | 296 | 294 |
As pointed out by Ben Wagner, singular form vs. plural form queries in Scifinder may lead to somewhat different results. Therefore we tested each term or expression under both forms.29 In most cases the hit counts are equal except for entries 1, 2, 8, 9, 10, 11, 12 and 13. For example the answer lists corresponding to ‘allene’ (entry 1) and ‘allenes’ (entry 2) contain references where the queried term was found as a concept. Combining the two answer lists (917 hits) and then the removal of duplicates (771 citations) furnishes 630 journal articles. The expression ‘N-heterocyclic carbene’ (entry 8) leads to a greater hit count (668 hits) than the corresponding plural form (231 hits). A processing similar to entries 1 and 2 led to 671 citations and 508 journal articles. This emphasizes in such cases that both singular and plural forms need to be searched. With respect to the expressions ‘modified nucleoside’ and ‘modified nucleosides’ (entries 10 and 11) the largest list contains the smallest one after combining them. Because the references corresponding to the terms and expressions of entries 1, 2, 10 and 11 were selected through a concept search, the process of combining answer lists may be simplified by typing the singular and the plural forms within the same search and by using one of these forms within brackets. However this trick is not valid if the references corresponding to an expression are found containing this expression ‘as entered’ as in the cases of entries 8, 9, 12 and 13. Finally the term ‘material’ (entries 16 and 20) was searched as a concept on both databases under singular vs. plural forms, the resulting hit counts were found different from less than 0.05%.
The results of the expression in entry 15 are worthy of some specific explanations because we initially performed this search by selecting the expression ‘copper (Cu) catalyzed arylation’ found as a concept thus leading to 375 hit counts. This high value is mostly due to a high occurrence number of the term ‘aryl’ resulting from the truncating step of the NLQ process. In order to retrieve only chemically answers relevant to the arylation concept, we ruled out the term ‘aryl’ by building this query as follows: (i) references were found containing ‘copper catalyzed’ as entered (869 hits), (ii) references were found containing ‘Cu catalyzed’ as entered (254 hits), and (iii) the two answer sets were combined (104 hits). In parallel a reference list was found containing ‘arylation’ as entered (924 hits) and this latter hit set was intersected with the previously obtained 1044 hit set thus furnishing a final list of 76 hits. For entry 16 a similar process was set in order to get the terms ‘hybrid’ and ‘materials’ closer to each other. This query was built following the sequence: (i) references were found containing ‘hybrid material’ as entered (470 hits), (ii) references were found containing ‘hybrid materials’ as entered (780 hits), and (iii) the two answer sets were combined (1105 hits). In parallel a third reference set was found containing the concept ‘nanoparticles’ (39914 hits) and this latter set was intersected with the 1105 hit count set providing 266 hits as a final result. Because all queries were performed in March, April and May 2013, the hit counts may vary slightly if performed now.
The first point we attempted to address is related to the non-negligible proportion of duplicate answers observed within the Scifinder's answers whose total count is equal to 204 when summing all duplicates corresponding to each query. These internal duplicates were found among many Medline's articles that miss a DOI whereas the corresponding articles are assigned a DOI if the PubMed interface is queried. Among these 204 references, we observed too that some journal names are distinctly indexed between Medline and CAplus databases. Representative examples are given in Table 2. With respect to the Scopus's and WoS's databases only one and two duplicates were found respectively.
Article count | CAplus | Medline |
---|---|---|
18 | Acta Crystallographica, Section E. Structure Reports Online | Acta crystallographica.chrom Section E, Structure reports online |
39 | Angewandte Chemie, International Edition | Angewandte Chemie (International ed. in English) |
86 | Chemistry -- A European Journal | Chemistry (Weinheim an der Bergstrasse, Germany) |
Chemistry – A European Journal |
Tables 3 and 4 display the queries specific to WoS and Scopus, respectively, and the resulting hit counts related to the selected terms and expressions used within Scifinder's topic searches. Keeping in mind that Scifinder's topic searches include by default all indexing terms from titles, abstracts, index terms and supplementary terms we selected the corresponding WoS's search field TS (column 3) that covers the fields: title, abstract, author keywords and keywords Plus®. Queries to Scopus (column 3) were performed through the document search tab in basic mode together with the option gathering together title, abstract and keywords. By this way the retrieved answer lists are equivalent to the ones retrieved by using the field sum ‘TITLE-ABS-KEY-AUTH’ available in advanced search mode. In order to perform topic searches comparable to Scifinder's topic searches, the use of the right-hand truncation was systematically preferred because this enables a better control on WoS's and Scopus's queries. Boolean operators were also employed to target precisely all queries, especially the proximity operators, available in WoS and Scopus that retrieve the searched terms within the same bibliographic field. The WoS's operator NEAR searches terms that are distant by default at a maximum of 15 terms but this distance may be shortened. Terms within double quotes were alternatively searched as an exact expression (entries 7, 11, 12 and 17, Table 3). The logic for the proximity operator W/n is similar in Scopus. This operator requires defining a number n equivalent to the distance between the searched terms. The automatic truncation in Scifinder was offset within WoS's and Scopus's searches by extensive use of wildcards as exemplified in entry 1 (Tables 3 and 4) thus enabling the terms ‘allene(s)’ or ‘allenyl’ or ‘allenic’ to be retrieved.
Entry | Queried expression in Scifinder | Queried expression in WoS | Articlesa | Unique articlesb |
---|---|---|---|---|
a Article counts after filtering by year, language and document type. b After parsing by the Iddup script. | ||||
1 | Allene(s) | TS = allene* OR TS = alleny* OR TS = alleni* | 503 | 466 |
2 | Organocatalysis | TS = organocataly* | 997 | 963 |
3 | Peptidomimetics | TS = peptidomimetic* | 282 | 257 |
4 | Agostic interactions | TS = (agostic NEAR interaction*) | 65 | 63 |
5 | Battery electrodes | TS = ((battery OR batteries) NEAR electrode*) | 1068 | 1008 |
6 | Graphene biosensors | TS = (graphene NEAR biosensor*) | 47 | 47 |
7 | N-Heterocyclic carbene(s) | TS = “N-heterocyclic carbene*” | 629 | 629 |
8 | Modified nucleoside(s) | TS = (modif* NEAR nucleoside*) | 117 | 114 |
9 | Phosphine ligand(s) | TS = (phosphine NEAR1 ligand*) | 322 | 319 |
10 | Renewable feedstock | TS = (renewable NEAR feedstock*) | 110 | 92 |
11 | Copper (Cu) catalyzed arylation | (TS = (“copper catalyzed”) OR TS = (“Cu catalyzed”)) AND TS = arylation | 105 | 103 |
12 | Hybrid materials and nanoparticles | TS = (“hybrid material*”) AND TS = nanoparticle* | 256 | 234 |
13 | Viscosity of ionic liquids | TS = viscosity AND TS = ionic liquid* | 360 | 350 |
14 | Band gap in solar cells | TS = ((band NEAR gap) AND (solar NEAR cell*)) | 677 | 599 |
15 | Statistical analyses of DNA microarrays | TS = statistical analyses of dna microarrays | 93 | 93 |
16 | Surface area in mesoporous materials | TS = (“surface area”) AND TS = (mesopor* material*) | 207 | 193 |
Entry | Queried expression in Scifinder | Queried expression in Scopus | Articlesa | Unique articlesb |
---|---|---|---|---|
a Article counts after filtering by year, language and document type. b After parsing by the Iddup script. | ||||
1 | Allene(s) | Allene* OR alleny* OR alleni* | 359 | 356 |
2 | Organocatalysis | Organocataly* | 770 | 758 |
3 | Peptidomimetics | Peptidomimetic* | 299 | 294 |
4 | Agostic interactions | Agostic W/15 interaction* | 49 | 48 |
5 | Battery electrodes | Batter* W/15 electrode* | 1020 | 816 |
6 | Graphene biosensors | graphene W/15 biosensor* | 66 | 59 |
7 | N-Heterocyclic carbene(s) | N-Heterocyclic W/1 carbene* | 462 | 458 |
8 | Modified nucleoside(s) | (modif* W/15 nucleoside*) | 114 | 113 |
9 | Phosphine ligand(s) | Phosphine W/1 ligand* | 286 | 284 |
10 | Renewable feedstock | Renewable W/15 feedstock* | 242 | 154 |
11 | Copper (Cu) catalyzed arylation | ((copper W/1 catalyzed) OR (Cu W/1 catalyzed)) AND arylation | 52 | 52 |
12 | Hybrid materials and nanoparticles | (Hybrid W/1 material* and nanoparticle*) | 271 | 235 |
13 | Viscosity of ionic liquids | Viscosity of ionic W/1 liquid* | 341 | 327 |
14 | Band gap in solar cells | Band gap in solar cell* | 606 | 424 |
15 | Statistical analyses of DNA microarrays | Statistical analys* of dna microarray* | 310 | 298 |
16 | Surface area in mesoporous materials | Surface area in mesopor* W/1 material* | 383 | 362 |
When comparing two references from input lists without internal duplicates, Iddup assigns each pair a score that is computed based on the following filters:
• initial score = 0
• if same DOI then score = 10 (and references are identical)
• if similar title then increment score +3
• if same journal then increment score +1
• if same author count then increment score +0.5
• if similar author and same position then increment score +0.5
• if same starting page then increment score +1.5
• if same volume then increment score +0.5
• if same issue then increment score +0.5
• if scores > 5, then the two references are considered as identical.
The second instruction enables the script to overlook the next instruction in case of same DOIs are found. A similarity computing was introduced at the third instruction that compares the titles because many titles contain abbreviations or Greek characters that are not always indexed in the same way by the different editors. These statements prompted us to introduce a 12% similarity score – 12% of the length of the longest title – that was computed using the Levenshtein distance.32 The influence of this parameter is discussed in Section 3.3. Likewise the author names present many discrepancies due to different spelling languages, typing errors or due to a different ranking in indexing their names. Our script was completed by correspondence arrays for some journal titles and for the Latin transcription of Greek characters. Finally Iddup discards citations corresponding to errata or corrections.
These results were refined through Iddup computing by identifying the common articles (column 4 in Tables 5–7) to each pair of editors and the specific articles to each editor (columns 3 and 6 in Tables 5–7). The union of the total article counts (column 7, Tables 5–7) is given by the sum of columns 3, 4 and 6 while column 8 represents the proportion of common articles to two editors. Preliminary observations show that these proportions vary dramatically from a maximum of 80.0 to a minimum of 11.4 percent (entries 4 and 15, Table 6). Higher proportions of common articles were generally observed for single-, double- or triple-term queries than for the queries including four terms.
Entrya | Scifinder | WoSz | |||||
---|---|---|---|---|---|---|---|
Uniq. articles | Spec.b | Commonc | Uniq. articles | Spec.d | Unione | Comm./unionf (%) | |
a Entries 1–16 correspond to the queried expressions of previous Tables 3 and 4. b Specific articles to Scifinder. c Shared articles by both editors. d Specific articles to WoS. e Sum of columns 3, 4 and 6. f Proportion of common articles to two editors. | |||||||
1 | 585 | 278 | 307 | 466 | 159 | 744 | 41.3 |
2 | 603 | 50 | 553 | 963 | 410 | 1013 | 54.6 |
3 | 305 | 123 | 182 | 257 | 75 | 380 | 47.9 |
4 | 51 | 6 | 45 | 63 | 18 | 69 | 65.2 |
5 | 806 | 328 | 478 | 1008 | 530 | 1336 | 35.8 |
6 | 87 | 46 | 41 | 47 | 6 | 93 | 44.1 |
7 | 450 | 46 | 404 | 629 | 225 | 675 | 59.9 |
8 | 98 | 30 | 68 | 114 | 46 | 144 | 47.2 |
9 | 253 | 90 | 163 | 319 | 156 | 409 | 39.9 |
10 | 113 | 47 | 66 | 92 | 26 | 139 | 47.5 |
11 | 51 | 16 | 35 | 103 | 68 | 119 | 29.4 |
12 | 177 | 50 | 127 | 234 | 107 | 284 | 44.7 |
13 | 335 | 105 | 230 | 350 | 120 | 455 | 50.5 |
14 | 430 | 248 | 182 | 599 | 417 | 847 | 21.5 |
15 | 93 | 72 | 21 | 93 | 72 | 165 | 12.7 |
16 | 294 | 232 | 62 | 193 | 131 | 425 | 14.6 |
Entrya | Scifinder | Scopus | |||||
---|---|---|---|---|---|---|---|
Uniq. articles | Spec.b | Commonc | Uniq. articles | Spec.d | Unione | Comm./unionf (%) | |
a Entries 1–16 correspond to the queried expressions of previous Tables 3 and 4. b Specific articles to Scifinder. c Shared articles by both editors. d Specific articles to Scopus. e Sum of columns 3, 4 and 6. f Proportion of common articles to two editors. | |||||||
1 | 585 | 275 | 310 | 356 | 46 | 631 | 49.1 |
2 | 603 | 33 | 570 | 758 | 188 | 791 | 72.1 |
3 | 305 | 103 | 202 | 294 | 92 | 397 | 50.9 |
4 | 51 | 7 | 44 | 48 | 4 | 55 | 80.0 |
5 | 806 | 319 | 487 | 816 | 329 | 1135 | 42.9 |
6 | 87 | 39 | 48 | 59 | 11 | 98 | 49.0 |
7 | 450 | 49 | 401 | 458 | 57 | 507 | 79.1 |
8 | 98 | 29 | 69 | 113 | 44 | 142 | 48.6 |
9 | 253 | 75 | 178 | 284 | 106 | 359 | 49.6 |
10 | 113 | 34 | 79 | 154 | 75 | 188 | 42.0 |
11 | 51 | 14 | 37 | 52 | 15 | 66 | 56.1 |
12 | 177 | 41 | 136 | 235 | 99 | 276 | 49.3 |
13 | 335 | 100 | 235 | 327 | 92 | 427 | 55.0 |
14 | 430 | 236 | 194 | 424 | 230 | 660 | 29.4 |
15 | 93 | 53 | 40 | 298 | 258 | 351 | 11.4 |
16 | 294 | 208 | 86 | 362 | 276 | 570 | 15.1 |
Entrya | Scopus | WoS | |||||
---|---|---|---|---|---|---|---|
Uniq. articles | Spec.b | Commonc | Uniq. articles | Spec.d | Unione | Comm./unionf (%) | |
a Entries 1–16 correspond to the queried expressions of previous Tables 3 and 4. b Specific articles to Scopus. c Shared articles by both editors. d Specific articles to WoS. e Sum of columns 3, 4 and 6. f Proportion of common articles to two editors. | |||||||
1 | 356 | 51 | 305 | 466 | 161 | 517 | 59.0 |
2 | 758 | 44 | 714 | 963 | 249 | 1007 | 70.9 |
3 | 294 | 92 | 202 | 257 | 55 | 349 | 57.9 |
4 | 48 | 3 | 45 | 63 | 18 | 66 | 68.2 |
5 | 816 | 268 | 548 | 1008 | 460 | 1276 | 42.9 |
6 | 59 | 16 | 43 | 47 | 4 | 63 | 68.3 |
7 | 458 | 38 | 420 | 629 | 209 | 667 | 63.0 |
8 | 113 | 28 | 85 | 114 | 29 | 142 | 59.9 |
9 | 284 | 63 | 221 | 319 | 98 | 382 | 57.9 |
10 | 154 | 66 | 88 | 92 | 4 | 158 | 55.7 |
11 | 52 | 14 | 38 | 103 | 65 | 117 | 32.5 |
12 | 235 | 87 | 148 | 234 | 86 | 321 | 46.1 |
13 | 327 | 83 | 244 | 350 | 106 | 433 | 56.4 |
14 | 424 | 132 | 292 | 599 | 307 | 731 | 39.9 |
15 | 298 | 226 | 72 | 93 | 21 | 319 | 22.6 |
16 | 362 | 204 | 157 | 193 | 36 | 397 | 39.5 |
Though the main results were recorded in 2010, we have extended the query timespan to the years 1990, 1995, 2000, and 2005 for the four expressions: ‘allenes’, ‘peptidomimetics’, ‘battery electrodes’ and ‘band gap in solar cells’. These expressions were selected because their corresponding queries furnished sufficient hit counts to be representative as soon as 1990. For example expressions such as ‘organocatalysis’ or ‘N-heterocyclic carbenes’ returned no answer in 1990 and 1995 and were thus discarded. A second selection criterion was based on variable lengths of these four expressions.
Full resulting data are included in the ESI† (Table S2). As general conclusions of this supplementary study, we noticed that: (i) the three databases lead to different result sets as in 2010, (ii) large non-overlapping result sets were found during the years 1990, 1995, 2000, and 2005, and (iii) the proportion of overlapping papers increases over the years except for ‘peptidomimetics’.
In order to close this section, we may mention that the overall averages of shared references by Scifinder/WoS, Scifinder/Scopus and Scopus/WoS are 40.8, 46.8 and 52.2% respectively.
– Journal: journal indexing may be absent or is stopped before 2010 or issue indexing is incomplete.
– Document types: Conference Proceedings, Book Reviews, and International Symposia that are not homogeneously indexed by the editors.
– Index terms: Indexing terms, Keywords and Keywords Plus®. In case of Scifinder, supplementary terms are included in index terms.
– Modified terms: (a) some journals do not provide any abstract; in those cases Scifinder designs an abstract that seems to be excerpted from the article conclusion, (b) some queried terms are indexed using a hyphen included in the retrieved term i.e. organo-catalytic, (c) the journal title is indexed in two different spellings, and (d) author keywords or titles or abstracts are modified.
– Abstracts: though provided by the journal, some abstracts are not indexed.
– Author keywords: though provided by the publisher, some author keywords are excluded from indexing.
– Different year: some issues are assigned a different year because the dates of the online publication and of the printed version are different.
– Wrong DOI: typographic errors were found in agreement with recent similar observations.34 We noticed that a non-negligible amount of articles were missing an assigned DOI. Indeed concatenation of all articles from a particular editor followed by the removal of internal duplicates revealed that 8.7, 6.5 and 4.7% of articles from Scifinder, Scopus and WoS, respectively, were missing a DOI.
– Miscellaneous.
ESI† (Doc 1) details the whole results corresponding to the ‘organocatalysis’ queried term, the ‘N-heterocyclic carbenes’, the ‘phosphine ligands’ and the ‘viscosity of ionic liquids’ expressions. Table 8 displays the results obtained for the ‘organocatalysis’ queried term. The main observed differences arise from the Index Terms row. The Keywords Plus® indexing of WoS provided more articles than those retrieved by Scopus's or Scifinder's term indexing, this latter editor showing the weakest efficiency of its term indexing policy within this example. We also checked the relevance of 50 randomly selected references from the 234 references only retrieved by the Keywords Plus®. At least 45 over these 50 references were strongly related to organocatalysis. With respect to the Modified terms row, Scifinder designed an abstract excerpted from the article conclusion in one case and in the other one a hyphen was introduced in the term ‘organocatalytic by WoS’ (Table 8, column 3). On the same row (Table 8, entry 4, column 4), a hyphen was introduced in the term ‘organocatalytic’ eight times by Scifinder and in one case the term ‘organocatalyst*’ was shortened to ‘catalyst*’ within the title. Within the Abstracts row the reference found by Scifinder (Table 8, column 3) presents an abstract that was not indexed by WoS. In the case of the journal ‘Angewandte Chemie, International Edition in English’, we checked 500 articles of this journal and we found that they were missing an indexed abstract by WoS. This statement is valid up to 2010 but many abstracts are indexed in more recent years. In column 4 (entry 5) the 10 references found specifically by WoS result from a left truncation of the term ‘organocatalyst’ to ‘catalyst’ in Scifinder. More surprising are the 149 references (entry 6, column 4) where Scifinder modified the original author keywords by shortening or suppressing the queried term.
Entry | Category | Scifinder/WoS | Scifinder/Scopus | Scopus/WoS | |||
---|---|---|---|---|---|---|---|
Scifinder (50)a | WoS (410)a | Scifinder (33)a | Scopus (188)a | Scopus (44)a | WoS (249)a | ||
a Numbers within brackets correspond to specific articles reported in entries 2 of Tables 5–7. b Numbers within brackets correspond to the sum of articles from the indexing categories 3 to 6. | |||||||
1 | Journals | 22 | 0 | 16 | 4 | 10 | 5 |
2 | Document types | 5 | 1 | 5 | 0 | 4 | 1 |
3 | Index terms | 4 | 234 | 7 | 148 | 0 | 232 |
4 | Modified terms | 2 | 9 | 0 | 5 | 1 | 0 |
5 | Abstracts | 1 | 10 | 2 | 29 | 11 | 1 |
6 | Author keywords | 0 | 149 | 0 | 0 | 0 | 1 |
7 | Different year | 6 | 2 | 1 | 1 | 11 | 3 |
8 | Wrong DOI | 5 | 5 | 1 | 1 | 7 | 6 |
9 | Miscellaneous | 5 | 0 | 1 | 0 | 0 | 0 |
10 | Checked index termsb | 6 (7) | 387 (402) | 9 (9) | 182 (182) | 10 (12) | 231 (232) |
Five articles were indexed by WoS with one misspelled character on their DOI compared to the original DOI (Table 8, entry 8). Finally the miscellaneous category contains articles where: (i) the filters applied to the document types during the querying step differ from one editor to another one thus during the analysis step Iddup discards citations corresponding to some unwanted document types i.e. book chapters and corrections, and (ii) the 0.8 similarity score on the titles and on the author names was in one case the reason why two references were wrongly differentiated.
If we consider all articles of a particular editor that are classified in the index terms or modified terms or abstracts or author keywords categories, the next question remains to verify whether the concurrent editor's database is really missing this specific information or not? To check this hypothesis we injected the DOIs or the bibliographic data of a given editor's articles corresponding to the above-mentioned indexing categories into the query interface of the concurrent editor. The results are displayed in the last row (Table 8, entry 10). For example 6 over 7 specific articles retrieved by Scifinder (Table 8, column 3) are also present in the WoS thus emphasizing the importance of the Scifinder's indexing policy in this case. Once this statement has been established we noted that only 22 articles (Table 8, entry 1) from specific journals and 5 articles (Table 8, entry 2) from the document type category belong specifically to Scifinder. The vast majority of articles retrieved by WoS (Table 8, entry 10, column 4) would have been retrieved likewise by Scifinder if different indexing rules have been applied.
Comparing Scifinder and Scopus (Table 8, columns 5 and 6) on their specific references led to similar observations. The coverage of journals is in favour of Scifinder whereas Scopus retrieves a higher article count owing to its term indexing. Moreover Scopus indexes in the case of 3 reviews not only the abstracts but also the tables of contents where the queried term is present. We noticed too that author keywords were neither suppressed nor modified.
By comparing Scopus and WoS (Table 8, columns 7 and 8), we observed that WoS shows a high count of articles retrieved by the Keywords Plus® indexing. Among the 11 articles included in the abstracts category (Table 8, column 7) 3 reviews are present indexed by Scopus within their tables of contents. The 8 remaining articles of the abstracts category correspond to references for which WoS did not index the abstract. We observed that the different year category displays a rather important amount of articles: 11 articles are indexed by WoS in 2009 or 2011 and 3 articles are indexed by Scopus in 2009. Obviously these articles would have been retrieved by a multiple-year query. In the wrong DOI category were found the same articles as previously noticed.
In order to confirm the results displayed in Table 8, we analyzed some data from two-term queries and a three-term query (Table 9). The first studied expression was ‘N-heterocyclic carbenes’ (Table 9, columns 3 and 4) and the articles retrieved by Scifinder and Scopus respectively. Here again the influence of term indexing is predominant but to a smaller extent than previously. Within the modified terms category we observed that in some cases Scifinder developed the NHC acronym to ‘N-heterocyclic carbenes’ thus enabling the corresponding article to be retrieved. Finally 6 over 7 articles present in the miscellaneous category (Table 9, column 3) correspond to misspellings or typographic errors from Scopus.
Entry | Category | N-Heterocyclic carbenes | Phosphine ligands | Viscosity of ionic liquids | |||
---|---|---|---|---|---|---|---|
Scifinder (49)a | Scopus (57)a | Scopus (63)a | WoS (98)a | Scopus (83)a | WoS (106)a | ||
a Numbers within brackets correspond to specific articles reported in entries 2 of Tables 5–7. b Numbers within brackets correspond to the sum of articles from the indexing categories 3 to 6. | |||||||
1 | Journals | 7 | 13 | 8 | 2 | 3 | 4 |
2 | Document types | 2 | 0 | 2 | 1 | 2 | 0 |
3 | Index terms | 15 | 25 | 31 | 94 | 72 | 93 |
4 | Modified terms | 9 | 14 | 1 | 0 | 0 | 0 |
5 | Abstracts | 3 | 0 | 13 | 0 | 2 | 0 |
6 | Author keywords | 0 | 0 | 0 | 0 | 0 | 1 |
7 | Different year | 6 | 3 | 0 | 1 | 0 | 1 |
8 | Wrong DOI | 0 | 0 | 0 | 0 | 2 | 1 |
9 | Miscellaneous | 7 | 2 | 8 | 0 | 2 | 6 |
10 | Checked index termsb | 27 (27) | 39 (39) | 45 (45) | 91 (94) | 71 (74) | 88 (93) |
The next results concerned the two-term expression ‘phosphine ligands’ and the retrieved articles by Scopus and WoS (Table 9, columns 5 and 6). Apart from the predominant influence of term indexing by both editors, Scopus offers in this case a slightly better journal coverage and a better abstract coverage. In the miscellaneous category Scopus retrieved some articles containing the expanded forms of the ‘phosphine’ term such as ‘bisphosphine’ or ‘triphenylphosphine’. Finally we looked at the three-term query ‘viscosity of ionic liquids’ (Table 9, columns 7 and 8) and examined the specific articles retrieved by Scopus and WoS. The observed proportions within the different categories are similar to those obtained in previous cases, the index term category remaining the main differentiating one.
These last results (Tables 8 and 9) were not computed by any algorithmic process and only affect a part of the study presented in Tables 5–7. Nevertheless they reveal some interesting trends about the scope and the limits of term and keyword indexing policies of Scifinder, Scopus and WoS. If we focus now on the values displayed in different columns of entry 10 (Tables 8 and 9), we observe that a high proportion of articles retrieved in the indexing categories by a particular editor are present in both other editor's databases. Ultimately this emphasizes the influence of term and keyword indexing policies of these editors because most informative articles are shared by the three editors. In other words the proportion of information specific to a given editor is not as high as it could be expected from preliminary results displayed in Tables 5–7. Moreover the term and keyword indexing policies clearly differentiate the three studied editors in a higher proportion than their respective journal coverages do.
Expression | Scifinder/WoS | Scifinder/Scopus | Scopus/WoS | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
a Numbers within brackets correspond to specific articles reported in entries 2 of Tables 5–7. | ||||||||||||
‘Organocatalysis’ | Specific (50)a | Common (553) | Specific (410) | Specific (33) | Common (570) | Specific (188) | Specific (44) | Common (714) | Specific (249) | |||
Two best citation counts | 442 | 707 | 496 | 75 | 707 | 243 | 442 | 707 | 496 | |||
298 | 390 | 280 | 51 | 442 | 197 | 243 | 390 | 280 | ||||
‘N-Heterocyclic carbenes’ | Specific (46) | Common (404) | Specific (225) | Specific (49) | Common (401) | Specific (57) | Specific (38) | Common (420) | Specific (209) | |||
Two best citation counts | 200 | 445 | 1320 | 89 | 445 | 755 | 755 | 445 | 1320 | |||
82 | 233 | 649 | 80 | 233 | 395 | 395 | 388 | 649 | ||||
‘Phosphine ligands’ | Specific (90) | Common (163) | Specific (156) | Specific (75) | Common (178) | Specific (106) | Specific (63) | Common (221) | Specific (98) | |||
Two best citation counts | 182 | 200 | 259 | 182 | 200 | 755 | 755 | 200 | 259 | |||
119 | 101 | 249 | 119 | 101 | 92 | 92 | 101 | 249 | ||||
‘Band gap in solar cells’’ | Specific (248) | Common (182) | Specific (417) | Specific (236) | Common (194) | Specific (230) | Specific (132) | Common (292) | Specific (307) | |||
Two best citation counts | 1358 | 394 | 922 | 538 | 1358 | 628 | 1358 | 477 | 922 | |||
538 | 245 | 590 | 394 | 245 | 477 | 628 | 245 | 590 |
If we look only at the lines corresponding to the two best citation counts, we notice that high values are found either for specific or for common references. For the ‘organocatalysis’ expression, the common reference lists collect the best cited papers but for the three other expressions the best cited paper is found in one or two specific reference lists. We are interested in these references that are assigned such high citation counts. First the reference (Table 10, column 5 and 11, line 5) displaying a 1320 citation count corresponds to the journal ‘Chemical Reviews’ from the ACS editor. Any usual abstract is given by the ACS and the WoS enables to retrieve this article owing to its Keywords Plus®. We came to the same conclusion for the article that was assigned a 259 citation count (Table 10, column 5 and 11, line 11).
For the paper that was assigned 755 citations (Table 10, columns 5 and 6, lines 5 and 11), Scopus designed its own abstract from the conclusion of the original paper – still an article from ‘Chemical Reviews’. Finally the paper, that was assigned 1358 citations (Table 10, columns 3, 7 and 9, line 15), was retrieved as a common reference by Scifinder and Scopus that both designed their own abstract containing the queried expression. These abstracts were however different. The WoS did not retrieve this paper because it seems to not design abstracts from scratch.
General trends may not be concluded from these few results, but highly cited papers are retrieved by all three editors. Moreover abstract and keyword indexing play a non-negligible role within these examples.
Moreover a test was performed on the ‘organocatalysis’ expression and the results are summarized in Table 11. They correlate with the results of the previous paragraph but with an inverse trend. Low values of the threshold furnish a higher count of duplicates between Scifinder and another database because the count of unique articles from Scifinder is higher for low values of the Levenshtein distance. No difference was observed when comparing the Scopus's and the WoS's reference lists. It is worthwhile mentioning that the observed values for the common references (columns 3 and 6) as well as for the WoS's and Scopus's specific references are stable. In a second time the variation of the Levenshtein distance only affects Scifinder's specific references for values less than or equal to 9.
Levenshtein distance (%) | Scifinder | Common | WoS | Scifinder | Common | Scopus |
---|---|---|---|---|---|---|
3 | 52 | 553 | 410 | 35 | 570 | 188 |
6 | 51 | 553 | 410 | 34 | 570 | 188 |
9 | 50 | 553 | 410 | 33 | 570 | 188 |
12 | 50 | 553 | 410 | 33 | 570 | 188 |
15 | 50 | 553 | 410 | 33 | 570 | 188 |
18 | 50 | 553 | 410 | 33 | 570 | 188 |
Footnote |
† Electronic supplementary information (ESI) available: Additional information on the diversity of studied domains (Table S1), on the expanded timespan from 1990–2005 (Table S2), the studied references in Table 8 (Doc 1), the studied references in Table 10 (Doc 2) and the source code of the script (Doc 3). See DOI: 10.1039/c5nj01077b |
This journal is © The Royal Society of Chemistry and the Centre National de la Recherche Scientifique 2015 |