Monthly Archives: November 2015

Why Elsevier’s “solution” is the problem

It has been an interesting week after I posted my story on how Elsevier has hampered my research (I described my research here). This website went down at some point because the server load became a bit too much when the Techdirt story was ranking high on Hackernews.

tl;dr I will not agree with Elsevier’s TDM policy as it harms individual researchers and the impact of research on society.

Since I posted my story, Elsevier has contacted my library stating I can easily continue my research by using their API under their Text and Data Mining (TDM) policy. This usually requires a separate signed agreement with the university, but they have offered to give me a personal API code without such an agreement. Note that all work I would do with the API would still fall under their TDM policy.

The TDM policy directly threatens my academic freedom and as such I will not agree with it. Elsevier describes its policy here and says

When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license

which means Elsevier is imposing a license on my research output. This prevents me from publishing in journals such as PeerJ and the PLOS journals because they publish their research articles with a public license that allows for commercial use (i.e., CC-0 or CC-BY). I consider it my own choice where and how I want to publish.

The imposed non-commercial license also unnecessarily restricts the impact and re-use of my publicly funded research.

First, the differentiation between commercial and non-commercial is highly ambiguous and discourages re-use. For instance, if a blogger wants to upload my paper to her personal website, but the blog includes advertisements, would this be considered commercial or non-commercial? It seems to me that this ambiguity would lead to some choosing the safe-route and not re-using.

Second, clearly defined commercial entities now cannot distribute my work, while I, the author of the work, want them to be able to. Re-use possibilities are often unforeseen, and I will not forego these possibilities by having to assign a non-commercial license. For example, if a financial fraud detection company wants to print my research output and use it in a workshop, they would not be able to with a non-commercial license. Academics will still be able to read it, but the impact of research is larger when both non-commercial and commercial entities can use the knowledge to benefit society.

Elsevier is trying to force their API on me using the argument that scraping the website would overload the server. I have shown that the server load need not be large (what I am doing costed only 35KB/s, which is less than streaming the typical YouTube video). As a commenter on the original post mentioned, the Wikipedia API only “ask[s] that you be considerate and try not to take a site down.” Elsevier could institute a similar non-restrictive policy on screen-scraping.

Elsevier’s API is also incomplete, it for instance does not include images, which are vital to my research. As such, what Elsevier is offering me does not allow me to do what I was doing when scraping the webpage directly. The API is simply insufficient, besides imposing a license that threatens academic freedom.

It seems like Elsevier’s TDM policy does not have the researcher’s interests at heart and I can imagine this is one of the reasons library associations do not agree with Elsevier’s TDM license, for instance LIBER and the Dutch University- and Royal Libraries. For the reasons outlined above, I will not agree to Elsevier’s TDM policy as it harms me as a researcher and the impact of research on society.

Why I content mine

The prevalence of data fabrication in science has only been investigated via self-report surveys, which incorporate severe biases and only assess the prevalence amongst researchers. Statistical results reported in a paper can be analyzed for overconsistency, which can indicate potential data fabrication, humans are bad at generating random sequences (e.g., many people assess the probability of tails as larger than .5 if previous flips resulted in heads only) and this reflects itself in fabricated data.

To apply these statistical methods to detect data fabrication, the data reported in a paper must first be extracted. To this end, I looked to content mining research articles in the literature. Raw and summary data are reported throughout a paper in the main text, in the footnotes, in figures, and in tables. Some of the results can be readily extracted by both humans and machines when they are reported in a standardized format, but machine-reading gives a massive efficiency increase and increases the scope. Other data, such as extracting data from figures, necessitate machine-reading because it requires precision at the level of a pixel.

With the extracted results, I looked to test whether results in a paper were potentially fabricated. I looked to do this for all papers in a field, to estimate the prevalence of data fabrication via an innovative method, that decreases the bias due to self-report and allows us to asses prevalence of data fabrication amongst research articles.

Elsevier stopped me doing my research

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].

In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.

Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.

Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.

I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.

[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]