Manual text and data mining not appreciated by publisher

I will introduce myself first as I failed to do so in my first blog: I am an organizational researcher, mostly interested in inter-organizational relationships, networks, and organizational learning. As a hobby, I like to invest some time in meta-science as well. This time I will focus on the hobby.

I recently started working on a project where I, together with Chris Hartgerink, am checking how many statistical errors are made in articles from the organizational sciences. To this end, we intended to download empirical articles from a range of journals and publishers so that we could extract statistical test results . Given that Chris has some experience with Text- and Data Mining (TDM) on a large scale (and especially with publishers), I started investigating the terms and conditions of some publishers in my own field. To be clear: the downloading was for research purposes only and the articles would not be redistributed.

When reading the terms and conditions of the INFORMS website, a major publisher in management science, I saw that we could not “systematically download (robots or other automated processes)” articles. Therefore, I elected to manually download the articles from one of their journals. I downloaded a total of 429 articles in two days (first day I downloaded 96 articles in 21 minutes, second day I downloaded 333 in 5 hours).

To my surprise, my university was notified by INFORMS after the downloads of the first day that my IP-address was blocked from accessing all content of the publisher (this includes other journals that I did not download from). Our librarian team notified me a few days later of the notice and instructed me to stop downloading (which I did). They told me that downloading for TDM required an additional agreement with the publisher that would require different tariffs. Note that the original email from INFORMS stated “100 PDFs have been downloaded within a single session from this IP address, which is excessive and a violation of permissible activity.”

In conclusion, I thought I followed the publisher’s terms and conditions by manually downloading content from their website. Apparently, this was not the case. I do not understand why I can read the papers one-by-one myself, but am not allowed to save these for computer reading.

Assessing the use of #icanhazpdf

When researchers, journalists, or any other citizen is denied access to (parts of) the scientific literature, alternative access routes will come into existence as an act of civil disobedience, which is also called guerilla Open Access [1]. Access is typically denied when a person does not have a subscription to the journal in which the article is located. As a consequence, the person is presented with a paywall without fully knowing what the value of the paper is. Deciding whether to pay for it is then a difficult process.

Guerilla Open Access tries to circumvent these paywalls and grants users access to the full article despite the paywall. The legality of this can be debated, but fact remains that forms of guerilla Open Access have presented themselves in the recent past, which indicates that people are being denied access to the results of scientific research. Moreover, the outrage in the academic community has indicated that the morality of the current system is in doubt.

One of the alternative access routes that has come into existence is to request them from those who do have access — via Twitter. In 2011, Andrea Kuszewski ’invented’ a catchphrase to be used by Twitter users to request articles that they could not reach because they were found to be behind paywalls [2]. This catchphrase, #icanhazpdf, allows Twitter users to make their request publicly available after which a reader that can access this article may send the article to the requester. In a sense, #icanhazpdf is used as a peer-to-peer method of accessing scholarly articles. Due to the use of a hashtag, these requests are easily searchable for users that have access and thus allow for the sharing of academic articles amongst readers.

Results

Gardner and Gardner [3] conducted a study into the #icanhazpdf hashtag from the end of April 2014 to the beginning of August 2014. We conducted a similar study but our results range from 18 August 2015 to 12 December 2015. We used IFTTT (ifttt.co) to collect tweets using the #icanhazpdf hashtag. This captured the tweets in real-time and allowed us to prevent a systematic bias due to #icanhazpdf tweets being deleted by the original poster after the request is fulfilled.

The total number of tweets in this period was 9765. After the data was collected, we first excluded all retweets (6119). Subsequently, we manually coded all tweets, because we knew some tweets were about the hashtag instead of actually requesting a paper. When we disagreed, we simply said that it was not a clear request. This occurred in 294 cases [4]. Our results are thus likely a conservative estimate of the #icanhazpdf hashtag.

The total number of actual requests in this period was 2121. An upward trend over time is clearly visible when plotting these requests below.

fig1

Furthermore, a spike is visible at the end of October. We attribute this to news features on the BBC [5; released depicted with horizontal line in figure], and other publicity during Open Access week (19 to 25 October 2015). During the Open Access week, an average of 24.86 requests were made per day, whereas before Open Access week it was only an average of 9.71 requests.

Considering that we collected 9765, which only contains 2121 requests, it seems that the #icanhazpdf feed is relatively clogged by retweets and sharing of news stories. However, when we compare our results to those from Gardner and Gardner [3] we notice that the usage of #icanhazpdf has increased. Their total number of tweets is 750 requests, while we have found 2121 in roughly the same time span. Together with the figure above, it seems that the use of #icanhazpdf is on the rise.

However, when we fit a loess curve to our data we see that the attention in the media has led to a short increase in usage, but this seems to fade away to normal usage during December.

fig2

Thus, the use of #icanhazpdf seems to have increased over time, but might simply be due to increased publicity during Open Access week. Additionally, only 22% of all tweets with #icanhazpdf are actual requests for research papers. Moreover, the number of requests via #icanhazpdf (i.e., 2121) pales in comparison to the number of papers accessed via Sci-hub (i.e., 217,276; [6]).

Ps. we are both supportive of sharing our data openly for verification and reuse. However, we chose to share only anonymized data here [7] because we’d like to protect the anonymity of users of the hashtag.

[1] https://archive.org/stream/GuerillaOpenAccessManifesto/Goamjuly2008_djvu.txt
[2] https://twitter.com/AndreaKuszewski/status/28257118322688000
[3] http://www.ala.org/acrl/sites/ala.org.acrl/files/content/conferences/confsandpreconfs/2015/Gardner.pdf
[4] Cohen’s kappa = 0.8358917
[5] http://www.bbc.co.uk/programmes/p034vd50
[6] https://twitter.com/Sci_Hub/status/699935939502731268
[7] http://github.com/chartgerink/2015icanhazpdf