Awarded Shuttleworth Flash Grant

I am proud to announce that I have been awarded a Shuttleworth Flash Grant. This $5000 grant is an empowering grant, considering that there are simply no strings attached except communicating about what you do with it openly (YES: no budgets/proposals/record keeping/you name all the other tedious aspects of grants that detract from actually doing things with the grant).

Not only is it empowering because of a lack of bureaucracy — it is also a badge of honor considering how it is described: “we award a number of small grants to a collection of social change agents, no strings attached, in support of their work.” Being called a change agent sounds like a humongous complement to me! Additionally, the Shuttleworth Foundation just oozes openness (see video below), which adds to the weight I assign to the Foundation.

I am proud to have been chosen as a Flash Grantee and I look forward to finding effective ways to utilize it for change (e.g., for copyright reform). I will keep you posted on what I do with it here!

 

How a professional webpage can harm your privacy

I have a professional webpage, which serves as the homepage when people search for me on Google. Great — right?

Yes, but many academics do not realize that when they register their domain, they are making personal information public, potentially. More specifically, your personal email, your personal phone number, and home address. Several domain registrars see keeping this information private as an extra service and require additional fees, so be sure to check (send me an email if you want me to check your domain). I suggest you switch if they charge you for privacy.

So, the problem is that not all domain registrars keep your personal information private and if they do not, anyone can simply query the origin of the website and find it (e.g., with the whois terminal command). I have tested this on several professional webpages, and I have learned information about my colleagues I did not know before (e.g., home address), shamefully. Depending on the registrar, you can see information like this (of course this is a fictitious example)

Registrant Name: John Doe
Registrant Organization:
Registrant Street: Blastreet 12-A
Registrant City: Blacity
Registrant State/Province:
Registrant Postal Code: 42176
Registrant Country: US
Registrant Phone: 001555666777
Registrant Phone Ext:
Registrant Fax:
Registrant Fax Ext:
Registrant Email: john.doe@gmail.com

but when the domain registrar does keep it private (as for my webpage), it might just refer to the registrar

[chjh@pandorica ~]$ whois chjh.nl
[Querying whois.domain-registry.nl]
[whois.domain-registry.nl]
Domain name: chjh.nl
Status: active

Registrar:
Hostnet bv
De Ruijterkade 6
1013AA Amsterdam
Netherlands

Please be sure to check your professional page, or let me know if you would like me to check for you (I promise I won’t save any of the information). Despite Open Data in research, I think privacy for researchers is still warranted. Especially if you want to prevent getting harassed when you do research some might find controversial.

EDIT: I was notified that in some countries, this information is mandatory. For example, in Germany the Impressumspflicht mandates this information.

European Open Science Policy candidacy

In line with the high priority the European Commission has put on Open Science, the Directorate-General (DG) is currently working on forming an Open Science Policy Platform (OSPP). Its members will help build European policies on Open Science, ranging from inducing cultural change towards Open Science and regulations in European funding. I would like to put myself forward as a candidate to represent the early career researchers in this platform. I am drafting my candidacy letter in the open (comments welcome). If you are willing to endorse me, please comment on this post or tweet about this post and mention my handle, @chartgerink, saying you endorse my candidacy for the Open Science Policy Platform. Each endorsement counts (honestly: only those prior to Tuesday March 22 count because the application is due then).

The call for candidates explicitly states that they are looking for “high-level” experts with policy experience, indicating that this platform runs the risk of ignoring the interests of early career researchers such as PhD students or post-docs. We as early career researchers are by definition not “high-level” and lack experience on the policy level, whereas we are those who will be greatly affected by the renewed policies (and potentially the longest, because our careers will hopefully last).

For example, if European funding becomes subject to Open Science regulations, we as early career researchers will be the ones who have to figure out how to conduct research in an open fashion. Principal Investigators (PIs) receive the European grants with Open Science policies, but will have PhD students and post-docs conduct a large part of the research. As a consequence, we are saddled with the responsibilities of putting Open Science into practice. PIs have little idea how to do this, because they were never educated in this manner nor are they able to properly conduct Open Science. Thorough Open Science requires knowledge of all procedures and steps in a research process, which is difficult when you are supervising (PI) instead of conducting the research (PhD students, post-docs).

I therefore think that the OSPP requires an early career researcher as representative (whoever it may be), who knows the intricacies of putting Open Science into practice. I have been conducting my research in an open fashion since before my PhD and have found that easy Open Science is possible, but requires proper training. Currently, that training is missing. If the Commission thinks a successful cultural change to Open Science is possible without the input of early career researchers, leaving the policy-making to these “high-level”, experienced researchers only, I believe cultural change will be immensely difficult. I would like to help it succeed and partake in this platform as an early career researcher.

Wiley also stopped me doing my research

In November, I wrote about how Elsevier wanted me to stop downloading scientific articles for my research. Today, Wiley also ordered me to stop downloading.

As a quick recapitulation: I am a statistician doing research into detecting potentially problematic research such as data fabrication and estimating how often it occurs. For this, I need to download many scientific articles, because my research applies content mining methods that extract facts from them (e.g., test statistics). These facts serve as my data to answer my research questions. If I cannot download these research articles, I cannot collect the data I need to do my research.

I was downloading psychology research articles from the Wiley library, with a maximum of 5 per minute. I did this using the tool quickscrape, developed by the ContentMine organization. With this, I have downloaded approximately 18,680 research articles from the Wiley library, which I was downloading solely for research purposes.

Wiley noticed my downloading and notified my university library that they detected a compromised proxy, which they had immediately restricted. They called it “illegally downloading copyrighted content licensed by your institution”. However, at no point was there any investigation into whether my user credentials were actually compromised (they were not). Whether I had legitimate reasons to download these articles was never discussed. The original email from Wiley is available here.

As a result of Wiley denying me to download these research articles, I cannot collect data from another one of the big publishers, alongside Elsevier. Wiley is more strict than Elsevier by immediately condemning the downloading as illegal, whereas Elsevier offers an (inadequate) API with additional terms of use (while legitimate access has already been obtained). I am really confused about what the publisher’s stance on content mining is, because Sage and Springer seemingly allow it; I have downloaded 150,210 research articles from Springer and 12,971 from Sage and they never complained about it.

Writing to the European Commission and European Parliament

 
During OpenCon 2015 several of the community members joined forces to write a letter to the European Commission President Jean-Claude Juncker. The resulting letter was sent on November 17, 2015 and I received a response last week. It is good to see that the letter we sent was noted with interest. This was not the only letter that has been sent to members of the European Commission or Parliament.

As mentioned in my previous blog post, the European Commission has announced that they will be proposing legislation to reform copyright on many fronts, including copyright exceptions for content mining. This is a great first step and the following steps can be great as well.

Even though the European Commission has now announced their action plan, which incorporates exceptions for content mining, I urge those engaged with copyright reform to write to their MEPs or the European Commission in the next few months to specify further details of what is important to go into the legislation. For example, the category ‘public interest research organizations’ does not promote legal certainty and excludes citizen scientists, freelance scientists, and many more. Writing to your representatives helps get this on their radar and hopefully results in better legislation. I will be writing my letter soon and will share it here.

To simplify finding your representatives in the European Parliament, here is a link to the website where you can select your country and see all parliament members. A link to all European Commission members can be found here and the specific Commission members responsible for copyright reforms here.
 

My thoughts on the European Commission’s TDM reforms

Today, December 9 2015, the European Commission (EUC) announced their action plan for copyright reform within the Digital Single Market strategy. Responses are coming from many sides, including the League of European Research Universities (LERU) and Copyright for Creativity. I outline my primary thoughts on the action plan below.

I need copyright reforms to legally carry on my research and therefore I am glad to see the European Commission commit to this publicly. The results of their commitment are going to show themselves in the next six months. During these six months the legislative proposals will be developed to enact the intentions of the EUC. I think we have to be on the look-out such that the copyright reforms will be proper.

First, the legislative proposals considered in the next six months have to come from all stakeholders. The EUC text now states that they will be considering legislative proposals in the next six months, taking into account the impact on the publishing market. This takes a lopsided perspective on the problem: the impact of not making copyright reforms on the research enterprise should also be considered.

It is also important for these impact assessments to be evidence based. The importance of evidence-based impact assessment is crucial because arguments are being put forth that have no proper foundation. For example, the argument Elsevier put forth for not allowing me to mine their articles was server load, which I have proven they should be able to handle easily considering it was less load than a YouTube video and has been backed by a strong technical analysis by Cameron Neylon on the PLOS blog.

Second, we need to look out for renewed legal uncertainty by restricting who gets to mine content. The EUC text states that reforms are to provide legal certainty, but limits content mining reform to “public interest research organisations”. This would disqualify citizen scientists, freelance researchers, etc. from using content mining methods and actually increases legal uncertainty by restricting TDM to the vague nomer of “public interest research organisation”. Additionally, the EUC text limits TDM activities to non-commercial research purposes, which creates uncertainty about re-use rights of the research output, which also increases legal uncertainty. The EUC proposal aims to reduce legal uncertainty but also poses the risk of increasing it if reforms are not proper.

Third, but not least, the EUC has mentioned earlier that they might reconsider the “database directive”, which causes problems for TDM of non-copyrighted materials (not included in a database). They have not mentioned whether this is still on the table or not, which is something to look out for.

All in all, the proposed copyright reform plan by the European Commission is a good first step down a long path. We have to be vigilant and ensure that copyright reform will not go astray.

Why Elsevier’s “solution” is the problem

It has been an interesting week after I posted my story on how Elsevier has hampered my research (I described my research here). This website went down at some point because the server load became a bit too much when the Techdirt story was ranking high on Hackernews.

tl;dr I will not agree with Elsevier’s TDM policy as it harms individual researchers and the impact of research on society.

Since I posted my story, Elsevier has contacted my library stating I can easily continue my research by using their API under their Text and Data Mining (TDM) policy. This usually requires a separate signed agreement with the university, but they have offered to give me a personal API code without such an agreement. Note that all work I would do with the API would still fall under their TDM policy.

The TDM policy directly threatens my academic freedom and as such I will not agree with it. Elsevier describes its policy here and says

When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license

which means Elsevier is imposing a license on my research output. This prevents me from publishing in journals such as PeerJ and the PLOS journals because they publish their research articles with a public license that allows for commercial use (i.e., CC-0 or CC-BY). I consider it my own choice where and how I want to publish.

The imposed non-commercial license also unnecessarily restricts the impact and re-use of my publicly funded research.

First, the differentiation between commercial and non-commercial is highly ambiguous and discourages re-use. For instance, if a blogger wants to upload my paper to her personal website, but the blog includes advertisements, would this be considered commercial or non-commercial? It seems to me that this ambiguity would lead to some choosing the safe-route and not re-using.

Second, clearly defined commercial entities now cannot distribute my work, while I, the author of the work, want them to be able to. Re-use possibilities are often unforeseen, and I will not forego these possibilities by having to assign a non-commercial license. For example, if a financial fraud detection company wants to print my research output and use it in a workshop, they would not be able to with a non-commercial license. Academics will still be able to read it, but the impact of research is larger when both non-commercial and commercial entities can use the knowledge to benefit society.

Elsevier is trying to force their API on me using the argument that scraping the website would overload the server. I have shown that the server load need not be large (what I am doing costed only 35KB/s, which is less than streaming the typical YouTube video). As a commenter on the original post mentioned, the Wikipedia API only “ask[s] that you be considerate and try not to take a site down.” Elsevier could institute a similar non-restrictive policy on screen-scraping.

Elsevier’s API is also incomplete, it for instance does not include images, which are vital to my research. As such, what Elsevier is offering me does not allow me to do what I was doing when scraping the webpage directly. The API is simply insufficient, besides imposing a license that threatens academic freedom.

It seems like Elsevier’s TDM policy does not have the researcher’s interests at heart and I can imagine this is one of the reasons library associations do not agree with Elsevier’s TDM license, for instance LIBER and the Dutch University- and Royal Libraries. For the reasons outlined above, I will not agree to Elsevier’s TDM policy as it harms me as a researcher and the impact of research on society.

Why I content mine

The prevalence of data fabrication in science has only been investigated via self-report surveys, which incorporate severe biases and only assess the prevalence amongst researchers. Statistical results reported in a paper can be analyzed for overconsistency, which can indicate potential data fabrication, humans are bad at generating random sequences (e.g., many people assess the probability of tails as larger than .5 if previous flips resulted in heads only) and this reflects itself in fabricated data.

To apply these statistical methods to detect data fabrication, the data reported in a paper must first be extracted. To this end, I looked to content mining research articles in the literature. Raw and summary data are reported throughout a paper in the main text, in the footnotes, in figures, and in tables. Some of the results can be readily extracted by both humans and machines when they are reported in a standardized format, but machine-reading gives a massive efficiency increase and increases the scope. Other data, such as extracting data from figures, necessitate machine-reading because it requires precision at the level of a pixel.

With the extracted results, I looked to test whether results in a paper were potentially fabricated. I looked to do this for all papers in a field, to estimate the prevalence of data fabrication via an innovative method, that decreases the bias due to self-report and allows us to asses prevalence of data fabrication amongst research articles.

Elsevier stopped me doing my research

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].

In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.

Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.

Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.

I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.

[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]

Dealing with reference rot

One in five articles has been indicated to suffer from reference rot (Klein et al., 2014), which means that weblinks are used as references, but that these references do not operate anymore. Considering that the Internet is a dynamic place, where pages are created, but moreover, deleted on a daily basis, reference rot is problematic when writing papers.

With increasing amounts of blogging and publishing of reports being done on the Internet, it is becoming more important to tackle this issue as an academic. Several options are available, such as the Wayback Machine by the Web Archive, or Perma.cc. Using weblinks provided by these services for the page you saved, decreases the probability of reference rot.

My personal preference goes out to the Wayback machine, because you can insta-save any webpage that allows crawling. Subsequently, I just import this link instead of the original into my reference library.

Note that perma.cc is also similarly easy, but only saves your link for two years at a time, requiring resaving afterwards — except if the link is, as they call, ‘vested’. This means that an archived web page and its Perma link become permanently saved. This might have benefits over the wayback machine, by linking to libraries, but I am not quite clear yet.

Also, the Wayback machine requires less work, because it has a bookmarklet that directly saves a page (perma.cc probably will get it as well). Just save this link to your bookmarks and you are a-go! I now do this with any webpage I import into my reference manager.