Why Elsevier’s “solution” is the problem

It has been an interesting week after I posted my story on how Elsevier has hampered my research (I described my research here). This website went down at some point because the server load became a bit too much when the Techdirt story was ranking high on Hackernews.

tl;dr I will not agree with Elsevier’s TDM policy as it harms individual researchers and the impact of research on society.

Since I posted my story, Elsevier has contacted my library stating I can easily continue my research by using their API under their Text and Data Mining (TDM) policy. This usually requires a separate signed agreement with the university, but they have offered to give me a personal API code without such an agreement. Note that all work I would do with the API would still fall under their TDM policy.

The TDM policy directly threatens my academic freedom and as such I will not agree with it. Elsevier describes its policy here and says

When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license

which means Elsevier is imposing a license on my research output. This prevents me from publishing in journals such as PeerJ and the PLOS journals because they publish their research articles with a public license that allows for commercial use (i.e., CC-0 or CC-BY). I consider it my own choice where and how I want to publish.

The imposed non-commercial license also unnecessarily restricts the impact and re-use of my publicly funded research.

First, the differentiation between commercial and non-commercial is highly ambiguous and discourages re-use. For instance, if a blogger wants to upload my paper to her personal website, but the blog includes advertisements, would this be considered commercial or non-commercial? It seems to me that this ambiguity would lead to some choosing the safe-route and not re-using.

Second, clearly defined commercial entities now cannot distribute my work, while I, the author of the work, want them to be able to. Re-use possibilities are often unforeseen, and I will not forego these possibilities by having to assign a non-commercial license. For example, if a financial fraud detection company wants to print my research output and use it in a workshop, they would not be able to with a non-commercial license. Academics will still be able to read it, but the impact of research is larger when both non-commercial and commercial entities can use the knowledge to benefit society.

Elsevier is trying to force their API on me using the argument that scraping the website would overload the server. I have shown that the server load need not be large (what I am doing costed only 35KB/s, which is less than streaming the typical YouTube video). As a commenter on the original post mentioned, the Wikipedia API only “ask[s] that you be considerate and try not to take a site down.” Elsevier could institute a similar non-restrictive policy on screen-scraping.

Elsevier’s API is also incomplete, it for instance does not include images, which are vital to my research. As such, what Elsevier is offering me does not allow me to do what I was doing when scraping the webpage directly. The API is simply insufficient, besides imposing a license that threatens academic freedom.

It seems like Elsevier’s TDM policy does not have the researcher’s interests at heart and I can imagine this is one of the reasons library associations do not agree with Elsevier’s TDM license, for instance LIBER and the Dutch University- and Royal Libraries. For the reasons outlined above, I will not agree to Elsevier’s TDM policy as it harms me as a researcher and the impact of research on society.

13 thoughts on “Why Elsevier’s “solution” is the problem

  1. Susan Reilly

    Hi Chris,

    Thanks for mentioning LIBER’s position on the Elsevier TDM policy. It’s so important that researchers such as yourself share their experiences and speak out against the misconceptions out there about what you’re really doing (or not doing) when trying to text and data mine- not crashing servers, not stealing content and sending it to China, just downloading content to which you have legal access to at a reasonable pace with the aim of conducting legitimate research in a responsible manner.
    A Europe-wide copyright exception for TDM would provide researchers with legal clarity but parliamentarians claim that there is not enough evidence of demand for this and therefore licence “solutions”, such as the one above, will suffice. This is a plea to ask researchers to sign the Hague Declaration on Knowledge Discovery in the Digital Age (http://thehaguedeclaration.com/) and prove that researchers demand not to be cast as thieves but to be able to perform their research without fear of being shut down by publishers or their academic freedoms being subjugated by unreasonable licence terms and conditions.

    Susan Reilly, Executive Director, LIBER

  2. Ben Verhoeven

    Doesn’t “the output” of the content mining refer to your gathered data instead of your publication. Then the restriction to non-commercial would seem appropriate. And you can still publish your paper anywhere you want, your research data however cannot be used by commercial entities.

    Anyway, that’s how I would interpret it.

    1. Richard Smith-Unna

      I interpret the ‘output’ to mean the data you extracted during mining. This would usually refer to words or short phrases, numbers, and generally facts, that were automatically extracted from the text. The output is then collectively the dataset of things you collected.

      It is wholly inappropriate for Elsevier to require putting a license of any kind on that information.

      Firstly, because facts are not subject to copyright and such a license would then imply to potential users that they were not allowed to do something that they are in fact legally entitled to do (reuse the data however they choose). This is clearly unethical, and not in the public interest.

      Secondly, because many research funders and journals require data associated with a publication to be published under an open license, and specifically require that it not be NC. CC-BY is the most common requirement, and CC-0 (waiving copyright to the extent possible) is the most ethically sound as it makes it clear there are no restrictions on reuse, which is probably the legal case anyway.


  3. Jörg Prante (@xbib)

    There are other very practical limits of the TDM licensing “solution”.

    If “big paper data” is required to get analyzed from many hundreds or even thousands of publishers, does that mean a scientist is determined to ask each and every publisher of the papers for a license? And all licenses can be different? And what about orphaned papers, without an author or publisher? So before a scientist can begin with research it would be an extra burden to conduct license negotiations with several hundreds or thousands of vendors/authors/publishers, also taking the risk of being denied from TDM by individual reasons. This could take months or even years. While this might conform with today’s viewpoints of most publisher lawyers, it is against academic freedom. Scientists can not buy the globally available corpus of published paper data in full text each time they want to scan the data for evidence or to reproduce big data experiments performed by other scientists who were able to “license” TDM. The academic freedom includes to right to access any arbitrary measure of data used in published academic papers, as long as the purpose is to serve the progress of humanity and to create innovative knowledge.

  4. Alicia Wise (@wisealic)

    Hi Chris,

    There seem to be some misunderstandings here which I am writing to clear up. It’s very good to be in contact with you as we are very supportive of text and data mining by researchers, and have supported text and data mining since 2006.

    First, you would be able to publish papers that report on your research. I have personally checked the submission guidelines for PeerJ and PLOS and see nothing within them that would suggest you would be unable to publish in their titles. I am told, however, that you are not the first researcher to draw this unfortunate conclusion and we will take steps to clarify this in our online FAQs.

    Second, I’m not sure if you have seen our FAQs (https://www.elsevier.com/about/company-information/policies/text-and-data-mining/text-and-data-mining-faq) but these explain our guidelines for sharing your TDM corpus and your TDM outputs. The corpus is the collection of downloaded material accessed via Elsevier’s API. This will typically include copyrighted material from books and journals, open access articles and supplementary materials. The TDM outputs are your research results. The TDM corpus should not be shared with third parties. Instead you can share a reference to the contents of your corpus by creating a list of DOIs of the documents contained in your dataset. This will enable other researchers to recreate your corpus by retrieving the same set of DOIs using their own API key. Your TDM outputs can be used by you, and your institution or your organization, and they can also be distributed externally under certain conditions.

    Third, the reason we ask you to use our APIs is nothing at all to do with the legal framework under which mining is done and everything to do with ensuring that we can meet both the needs of our machine readers AND ALSO the needs of our human readers who can continue to read, search and download articles and not have their service interrupted in any way. As you might imagine, the access characteristics of human readers and machines are very different: we have a very large number of the former, however their rate of content downloading is modest; we have a much smaller number of researchers seeking to text and data mine using machines, however their rate of content downloading is much higher. You don’t state the number of articles you downloaded from ScienceDirect, however it appears from the total volume of data transferred that this was much higher than a typical single human reader would download in a similar period. You also suggest that your individual downloading exercise could not possibly have impacted our infrastructure – however that is not our concern – we need to ensure that we can support the simultaneous needs of millions of human readers and dozens of text miners, and in order to serve both use cases efficiently, it makes sense to separate the traffic into different channels optimized for each use case.

    As you acknowledge we are not alone in providing an API for this sort of high-volume access and APIs also are used by PLOS, Wikipedia and Twitter. We appreciate that users might wish to text mine across publisher platforms, and this is why we also participate in the multi-publisher cross-platform text and data mining service offered by CrossRef http://tdmsupport.crossref.org/

    Fourth, you can access images via our APIs. We have a separate image API. When we first launched the TDM API access to the image API was not granted automatically, but we’ve actually changed our system since then so that image API access is now automatically granted.

    Finally, it was suggested in the comments to your original blog post that “our API often fails to work” and it “does not serve the same content as our website”. I’d like to assure you that both the API and the website are backed by the same high-availability production data store, and make use of exactly the same underlying content. In fact, the APIs provide access to the underlying XML which provides a richer, more semantic and less presentational representation of the content than the website. We are not aware of any outages or issues with our APIs in recent weeks, however if you or others have experienced problems, we would like to know about them so that we can investigate. You can reach our API support team at integrationsupport@elsevier.com

    Always happy to talk if helpful!
    – Alicia

    Dr Alicia Wise
    Director of Access & Policy

  5. Pingback: Content-mining; Why do Publishers insist on APIs and forbid screen scraping? | petermr's blog

  6. Pingback: ‘Open’ is about equality | Green Tea and Velociraptors

  7. Pingback: Corporate censorship of academic research | Pearltrees

  8. Pingback: Link Roundup – Recent Items of Interest on Data, Open Access, and Scholarly Communication | e-Science Community

  9. Pingback: How to write to your MEPs about European Copyright reform | Green Tea and Velociraptors

  10. Pingback: Green Tea and Velociraptors | How to write to your MEPs about European Copyright reform

Leave a Reply

Your email address will not be published. Required fields are marked *

e.g. 0000-0002-7299-680X

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>