Elsevier stopped me doing my research

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].

In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.

Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.

Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.

I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.

[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]

39 thoughts on “Elsevier stopped me doing my research

  1. Alicia Wise (@wisealic)

    Dear Chris,

    We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping. You can get access to an API key via our developer’s portal (http://dev.elsevier.com/myapikey.html). If you have any questions or problems, do please let me know. If helpful, I am also happy to engage with the librarian who is helping you.

    With kind wishes,
    Alicia

    Dr Alicia Wise
    Director of Access & Policy
    Elsevier
    a.wise@elsevier.com
    @wisealic

    Reply
    1. Richard Smith-Unna

      Alicia, can you explain why you think downloading should use the API?

      In my case, I can’t accept the Elsevier TDM license since its provisions are unenforceable under the UK copyright exception.

      Quoting the UK government’s guidance on the TDM copyright exception:

      Publishers may wish to apply technological measures on
      networks for a number of purposes such as to ensure security or stability.
      These measures may be for reasons unrelated to text and data mining
      or may, for example, be intended to ensure that all users can access the
      benefits that text and data mining offers researchers. Examples of possible
      measures could be to impose a reasonable limit on download speeds or to
      control the number of times a user can access a network in a given period.
      These measures should not stop or unreasonably restrict any researcher’s
      ability to benefit from the exception.

      Elsevier’s API is unworkable in my experience, often failing to work, and certainly counts as un ‘unreasonable’ restriction. In many cases the API returns only metadata in the XML, compared to the fulltext PDF I can access on the website. Simply downloading the paper via the normal web service for readers is easy – much easier than using the API.

      Beyond that, you need to consider that the content served by the API is not exactly the same as that served by the web server. Under UK law I have the right to perform non-commercial TDM on anything I can read – and I can read the website.

      In addition, the license agreement requires a restrictive statement about reuse of the products of TDM to be attached to any output, but the statement restricts behaviours which are permissible under UK law.

      Reply
      1. Alicia Wise (@wisealic)

        Hi Richard,

        The reason that we require miners to use the API is so that we can meet their needs AND ALSO the needs of our human users who can continue to read, search and download articles and not have their service interrupted in any way. Under UK legislation, publishers can use “reasonable measures to maintain the stability and security” of their networks, and so the requirement to use this API is fully compatible with the copyright exception.

        Other text miners regularly use the APIs, and I don’t believe we have received reports of the APIs only returning metadata before. How frustrating this must have been for you. I would be very happy to connect you with technical support colleagues who can provide you with assistance or answer any questions you may have.

        You might find our text and data mining page and FAQs of interest: https://www.elsevier.com/about/company-information/policies/text-and-data-mining

        And also this article which explains how our text and data mining services work with the UK copyright exception: https://www.elsevier.com/connect/how-does-elseviers-text-mining-policy-work-with-new-uk-tdm-law

        With kind wishes,
        Alicia

        Dr Alicia Wise
        Director of Access & Policy
        Elsevier
        a.wise@elsevier.com
        @wisealic

        Reply
        1. Richard Smith-Unna

          Alicia,

          Thank you for your reply.

          My interpretation as a software engineer with 15 years experience of running web services, and that of legal scholars we have consulted, is that Elsevier’s API use requirement does not satisfy the condition of being a “reasonable measure to maintain the stability and security” of their networks. There are simpler alternatives that are less obstructive than what Elsevier has in place – rate limiting can be applied easily. Moreover, using the web interface with reasonable rate limits could not possibly impact the user experience of a site with the traffic that Elsevier’s network enjoys. If Elsevier believes that scraping with rate-limits applied impacts the experience of their other users, I challenge you to prove that it does.

          Whilst it was frustrating to receive metadata-only XML, I do not consider it my responsibility to pursue improvement of your system. I have hundreds of content providers to interface with in my work, and the only commonality is that they all have a web presence that can be accessed in a browser.

          By far the easiest way to address this is to use cross-publisher APIs (like crossref, pubmed, and EUPMC) in the first instance. If any of those fails (as in the case of Elsevier), or if a content provider does not provide material via any of those APIs, I fall back to the web interface download alternative. If publishers would like to encourage use of APIs, they should make their content available through the existing systems with as few limitations as technically possible, and without requiring extra publisher-specific steps to be taken.

          It is a simple reality that if your API makes it harder for researchers to do their work, they will make use of their legal right to mine via responsible web scraping.

          Reply
      2. Anthony Smith

        My reading of the UK law is that it says nothing about reuse of the products of TDM. This makes it weak but it also means that requiring a statement about reuse (however restrictive) cannot restrict behaviours that the law permits.

        If the XML provided by the API falls short of the content in a PDF then that is a shame and I would urge TDM researchers to feed this back and urge Elsevier to fix it. Analysing PDFs scraped from web sites strikes me as a poor use of time and energy that would be better invested in advancing research. Just because you can read a PDF (or a web page) doesn’t mean it is the best foundation for TDM if better alternatives such as XML are available.

        Reply
        1. Richard Smith-Unna

          Anthony, thank you for your reply.

          Under UK law, copyright and all other intellectual property rights do not apply to facts. Collections of facts might enjoy protection under sui generis database rights, but that rarely applies to the output of mining scientific papers.

          You are absolutely right that adding a statement about reuse cannot legally restrict behaviours that the law permits, but in practise it does exactly that. Most potential users of scientific data are not intellectual property law experts, and on sight of such a statement will simply avoid the data. To add such a statement to my own work would be against the public interest, and unethical.

          You are quite right that XML falling short of the PDF content is a shame. However, especially in the case of older material, PDFs are often the only archive of content available. We have an array of technological approaches to extracting and cleaning data from PDFs, and if they are the only choice, we can work with them quite well. XML is preferable, but not if it means taking a lot of time out to debug APIs with each individual content provider.

          Best,
          Richard

          Reply
  2. Sebastiaan Mathôt

    Hi Alicia,

    Does this mean that, if you go through the API, you’re allowed to mine the full text of all Elsevier articles that you also have access to via ScienceDirect? Unlimited text mining, in other words, as long as you go through the API.

    If so, then what’s the logic behind not allowing text mining through ScienceDirect? What difference does it make to Elsevier if a researcher chooses to be inefficient in the way he/she mines text? (Assuming that the API is more efficient, which I imagine it is.)

    Cheers,
    Sebastiaan

    Reply
    1. Alicia Wise (@wisealic)

      Hi Sebastiaan,

      The reason that we require miners to use the API is so that we can meet their needs AND ALSO the needs of our human users who can continue to read, search and download articles and not have their service interrupted in any way. Science Direct holds 11 million pieces of content, shares infrastructure with Scopus, ClinicalKey, and other Elsevier products, and serves millions of researchers. I am told we are not alone in providing an API for this sort of high-volume access and that APIs also are used by others including Wikipedia and Twitter. We appreciate that users might wish to text mine across publisher platforms, and this is why we also participate in the multi-publisher cross-platform text and data mining service offered by CrossRef http://tdmsupport.crossref.org/

      With kind wishes,
      Alicia

      Dr Alicia Wise
      Director of Access and Policy
      Elsevier
      a.wise@elsevier.com
      @wisealic

      Reply
    2. Ross Mounce

      In response to Sebastiaan, I think there are extremely good reasons not to use the Elsevier API, not least those mentioned by Richard Smith-Unna. For instance they have rate-limits and restrictive terms & conditions on usage. It is not in any way “unlimited”.

      “Elsevier has chosen to provisionally limit researchers to 10,000 articles per week” — Nature News
      http://www.nature.com/news/elsevier-opens-its-papers-to-text-mining-1.14659

      This is far too restrictive to be useful. I support Chris in his decision not to use Elsevier’s API. I have also done mining work at the Natural History Museum, London on ScienceDirect content and I did not use the Elsevier API. Researchers should be free to choose which tools and methods they use to do research.

      Reply
      1. Alicia Wise (@wisealic)

        Hi Ross,

        This is incorrect, and there is no hard limit on the number of articles that can be mined per week. We do have some rate limits in place to ensure equal access to the API for all users, but feedback from researchers suggests these are reasonable. You can access up-to-date information about our TDM services here: https://www.elsevier.com/about/company-information/policies/text-and-data-mining/text-and-data-mining-faq

        With kind wishes,
        Alicia

        Dr Alicia Wise
        Director of Access & Policy
        Elsevier
        a.wise@elsevier.com
        @wisealic

        Reply
        1. Ross Mounce

          Dear Alicia,

          Thank you for your comment. At the moment, Elsevier’s API policy is terribly unclear. You state “there is no hard limit on the number of articles that can be mined per week” – thank you for being so specific. However I am intrigued by your next sentence which is not so specific: “We do have some rate limits…”

          If these unspecified limits are not on number of articles, perhaps they are on bandwidth (or some other property)? It would be extremely helpful if Elsevier was clearer about what its rate limits actually are. Publish this information, clearly! Both on the Elsevier site you linked to, and your comments here the information given appears to be purposefully vague and unhelpful. I cannot use a service for which I honestly still don’t understand the limits of.

          Reply
  3. Pingback: Elsevier stopped me doing my research | Science...

  4. David Roberts

    So if if it’s only 9 a minute, what’s stopping 20 of my colleagues downloading an article from ScienceDirect every two minutes for our shared reading group? On the other hand, there could even be hundreds of people at my university alone simultaneously accessing ScienceDirect, thousands across the country, tens of thousands or hundreds of thousands globally. I hope the SD servers can stand up to that. I’m getting worried, given the statements above…

    Reply
  5. Sebastiaan Mathôt

    Hi Alicia,

    (I cannot seem to re-reply directly to your comment, so I’ll post it like this.)

    First, thanks for taking the time to reply, and giving Elsevier’s point of view. However, I would like to press you a bit on my main question, which you didn’t answer:

    Does this mean that, if you go through the API, you’re allowed to mine the full text of all Elsevier articles that you also have access to via ScienceDirect? Unlimited text mining, in other words, as long as you go through the API.

    If no, then I feel that your reply is disingenuous—suggesting that all researchers need to do is use the API, while this is in fact restricted. On the other hand, if yes, then you have point. So …? It’s a simple yes/ no question.

    Cheers,
    Sebastiaan

    Reply
  6. John Mark Ockerbloom

    Alicia Wise writes:

    “I am told we are not alone in providing an API for this sort of high-volume access and that APIs also are used by others including Wikipedia and Twitter. ”

    While Wikipedia supports access through an API, they don’t use it as a way to limit access, as Elsevier apparently does. First of all, the Wikimedia API doesn’t have hard limits on access; the documentation simply says “There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down.” (See https://www.mediawiki.org/wiki/API:Etiquette . Some WIkimedia instances can add rate limits, but they’re not built into the API and I’m not aware of Wikipedia imposing a hard limit.)

    Second, Wikipedia regularly makes their full content set available for analysis as well, via direct FTP download or BitTorrent. I use this myself– every month, I download a dump file with all the articles in English Wikipedia, in order to run programs over them that derive data for my Forward to Libraries service. That’s over 5 million articles I get every month, or over 100 times as many articles per month as Elsevier lets researchers download, if Ross Mounce’s figures above are correct.

    In other words, a nonprofit with an annual budget of under $70 million supports full data downloads and still allow its users to “continue to read, search and download articles and not have their service interrupted in any way.” If a company with over $3 billion in annual revenue won’t do the same, it’s not for service-continuity or other technical reasons.

    Reply
  7. Sebastiaan Mathôt

    I hate to be the devil’s advocate here, but it seems like Alicia is correct: The API indeed allows full access to subscribed content in a way that doesn’t seem much more restrictive than usual. (Although ‘usual’ is very restrictive, of course.) You can see the registration form here:

    https://www.elsevier.com/__data/assets/pdf_file/0012/102234/TDM-sign-up-short-form.pdf

    That’s my understanding of the terms, anyway. And, of course I have no idea whether the API works technically well enough to be useful.

    Reply
    1. Peter Murray-Rust

      There are many reasons why the API is problematic. The main ones at present are:
      * I have to agree to Elsevier’s terms and conditions (even to look at it)
      * I have disclose personal details about myself andf my research to Elsevier.

      That is before I even know whether the API does what I want it to do.

      Reply
  8. Pingback: Why Elsevier’s “solution” is the problem | Chris H.J. Hartgerink's Notebook

  9. Pingback: Content-mining; Rights versus Licences | petermr's blog

  10. Jeffrey Beall

    So, the purpose of this blog post is to paint Chris H.J. Hartgerink as the victim of Elsevier and therefore an open-access hero. Nicely done, Chris. In reality, it’s just a solipsistic essay that reveals the author’s ignorance about data mining. Fail.

    Reply
    1. Costa Vakalopoulos

      Solipsism:

      2. Extreme preoccupation with and indulgence of one’s feelings, desires etc; egoistic self-absorption

      Would you mind Jeffrey enlightening us all on API so we might share your vision?

      Reply
    2. Mike Taylor

      Yes, it’s a real shame that content-mining specialist Chris Hartgerink is so ignorant about data mining compared with anti-OA trolling specialist Jeffrey Beall. If only Chris could have had Jeffrey’s skills and experience, all this would have been so much better. Elsevier would never have cut off Jeffrey’s access! Silly Chris.

      Reply
  11. Pingback: Content-mining; Why do Publishers insist on APIs and forbid screen scraping? | petermr's blog

  12. Pingback: Press and blog review | Blog @HEC Paris Library

  13. Pingback: Corporate censorship of academic research | Pearltrees

  14. Pingback: Copyright Reform: C4C Applauds, Regrets and Opposes | C4C

  15. Pingback: Green Tea and Velociraptors | How to write to your MEPs about European Copyright reform

  16. Pingback: Wiley also stopped me doing my research | Open Notebook Science Network

  17. Pingback: Wiley also stopped my doing my research | Chris H.J. Hartgerink's Notebook

  18. Pingback: Impact of Social Sciences – Announcing OpenCon 2016: Catalyzing collective action for a more open scholarly system.

  19. Pingback: Did I just ‘make’ all of APA Open Access? | Chris H.J. Hartgerink's Notebook

Leave a Reply

Your email address will not be published. Required fields are marked *


e.g. 0000-0002-7299-680X

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>