3. Citation Statistics, Scientometrics

© Gábor L. Lövei, CC BY 4.0 https://doi.org/10.11647/OBP.0235.03

While the quality of science is extremely difficult to measure, the field of scientometrics attempts to do so by studying how the impact of scientific publications can be measured. The task remains elusive, but one system, quite widely in use now, argues as follows.

In a scientific paper, there are only two types of factual statements: already published, known information, which is necessary to enable people to understand how new information relates to earlier material, and the new information. While the new information is supported by the facts, data, figures and tables presented in the paper, the known facts are simply mentioned, with the reference to a publication where the relevant fact was first proven/published. This is called a citation. The precise bibliographic data of such citations are listed at the end of published papers, and they can be identified, counted, and summarised.

Important findings, goes the argument, generate new research, and when the new discoveries are published, these previously published findings are cited as connecting links to the understanding of the new discovery. Such papers are therefore frequently cited. This approach equates high citation rates with high “impact”, which, according to this simplified perception, also indicates high importance and/or quality.

It is easy to see that, even if we accept the above argument, a few key questions must be decided: what counts as a citation, where do we do the counting, who does the counting, and for how long?

This is where business sense and sharp thinking came together to create a business opportunity, as well as a new field of analysis. Using its unique position, the Institute of Scientific Information (abbreviated to ISI; but one should not be misled by the name — this was not an institute, it was a business venture, publishing Current Contents) declared that a) we — ISI — will do the counting; b) a citation counts only if it appears in a journal covered by our publication, Current Contents, and; c) citations are “valid” and counted over a period of only 2 years after the publication date.

Originally, the purpose was to identify the most influential journals, and according to the ISI philosophy, these were journals that published the most frequently cited articles. Citation (only during the 2 years after publication, remember) equalled scientific impact, and the index thus coined was named the “impact factor” (abbreviated to IF). Despite discussions and doubts almost from the beginning, IF has caught on and, today, there hardly is a scientist unaware of the term. The success of Current Contents had a knock-on effect on journals, and the ones with a higher IF had an advantage over their rivals, in terms of distribution, recognition, and competition for manuscripts presenting discoveries that were thought important. The same statistics were soon applied to organisations and even to individual scientists, and when ISI was sold to Thomson-Reuters, aggressive promotion of these more dubious uses intensified.

A multitude of indices based on citation statistics has appeared since this original index, and there are several books and fora discussing their merits and demerits — the reader is directed to some of these; as a first step, to the ISI website itself, which today calls itself “Web of Knowledge” (https://www.webofknowledge.com). Here, only two of the most widely known indices are mentioned: the impact factor (IF) and the Hirsch index (h-index).

The IF of a journal is defined as the average number of citations that a single article, published in that journal, receives in the range of journals covered by Web of Science in the two years after publication (see Box 2 for an example of how to calculate IF). It is worth pointing out — even if this has been done many times — the hubris that the naming of the statistics displays. Being a competitive species, humans could not resist taking the next step, from ranking journals this way to ranking scientists following a similar logic: scientists who publish in high-IF journals are important scientists, and those who do not, are not. There are many pitfalls along that route, and for a more detailed discussion, readers can find several sources; a good recent example is Mingers and Leyesdorf (forthcoming).

Even if we accept the above logic for assessing individual scientists, the use of the IF to rank journals where one publishes is imperfect — the IF values are averages, while the distribution of citations are very right-skewed: very few articles get much more than their expected share of citations, and become fashionable, or “citation classics”. Most articles get much less than the expected average number of citations: thus, the overall distribution of citations is very right-skewed. This was named the “Matthew Principle”, a tongue-in-cheek reference to a passage in the Bible (Matthew 25:29, RSV) claiming that to those who have, more will be given, and the poor will lose even what little they have.

Given this state of affairs, a second, more logical, step was to use the number of actual, rather than potential, citations to assess scientists. Again, a multitude of indices have been suggested (Harzing, 2002); currently, much in vogue is the Hirsch-index, or h-index (Hirsch, 2006). To calculate someone’s h-index, all her publications are ranked according to the number of citations attracted, from the highest to the lowest. A person’s Hirsch index equals the number where the number of citations for any individual paper is not smaller than its rank number (see Box 3 for a calculated example). Several modifications and alternatives have been suggested, and the reader can find a good summary of these in the help files of the program “Publish or Perish”, developed by Anne-Wil Harzing (see her website: www.harzing.com).

To be included among the journals covered in Current Contents originally, a candidate journal had to fulfil stringent criteria: regular publication according to a schedule, papers written by an international range of authors and on topics of wide interest, and a reasonably wide international distribution. Journals usually must wait for at least three years before they can get their first impact factor. Journals are now also ranked by their relative position in their category (occasionally in several categories), usually by quartiles (e.g. a Q1 journal is in the top 25% of its group); sometimes the top 10% also forms a separate class (called D1).

The citation statistics of thousands of journals are collated and published in the Journal Citation Reports (JCR), issued yearly by Web of Science. These statistics, available only by subscription, are widely known, popularised, and used for various purposes. Recently, a few alternatives have emerged. Scopus (www.scopus.com) collects citations and various scientometric indices from the Internet, but its coverage of the literature is limited. This is a for-payment service, but the freely available program “Publish or Perish” (see above), calculates numerous citation statistics, using information in the free database Google Scholar. Harzing runs a well-maintained website, and published a book (Harzing, 2010) that describes many of the advantages and disadvantages of using scientometric indices. Google Scholar itself also has the capacity to calculate scientometric indices that can be used by any registered visitor. Both platforms are less English-biased than Web of Science.

Citations have become the dominant way of measuring scientific impact, and various statistics related to them are followed, counted, collected, documented and used by scientists themselves, as well as by journals and various science-related organisations. Citations are also being manipulated in various ways, the easiest of which is self-citation. This is done by journals as well as individual scientists and consequently, today, there is a distinction between “independent” and “dependent” citations. A citation counts as independent if no author of the citing document is an author on the cited document. If even one of the cited authors is also a citing author, this is counted as dependent or self-citation.

In general, there is much to resent in the superficial use of scientometric indices, and scientists must engage with science administrators to increase the mutual understanding of the benefits and limits of these methods. I suggest that readers familiarise themselves with the basics of scientometrics and become aware of some of the major controversies, because the use of such statistics is not going to disappear from science. The field is fast developing, with a major academic journal, Scientometrics, and numerous books (e.g. Vinkler, 2010) dedicated to the topic. The misuse of scientometrics lead to the San Francisco Declaration that provides guidance to the various parties engaged in science, from practice to policy (see https://sfdora.org/).

Powered by Epublius