***What do you envision that this kind of technology could do in a foreseable future?What do you mean by “this” kind of technology? If you mean computerized tools assisting scholars, I think there is massive potential in both development of new tools to extract information (for example what ContentMine is doing) and in application. Some formidable means are already here. For example, how much time do you spend as a scholar to produce your manuscript when you want to submit it? This does not need to cost half a day when there are highly advanced, modern submission managers. Same when submitting revisions. Additionally, annotating documents colloboratively on the Internet with hypothes.is is great fun, highly educational, and productive. I could go on and on about the potential of computerized tools for scholars.
Why do you think this kind of computerized statistical policing is necessary in the field of psychology and in science in general?
Again, what is “this kind of computerized statistical policing”? I assume you’re talking about
statcheck only for the rest of my answer. Moreover, it is not policing — a spell-checker does not police your grammar, it helps you improve your grammar.
statcheck does not police your reporting, it helps you improve your reporting. Additionaly, I would like to reverse the question: should science not care about the precision of scientific results? With all the rhetoric going on in the USA about ‘alternative facts’, I think it highlights how dangerous it is to let go of our desire to be precise in what we do. Science’s inprecision has trickle down effects in the policies that are subsequently put in place, for example. We put in all kinds of creative and financial effort to progress our society, why should we let it be diminished by simple mistakes that can be prevented so easily? If we agree that science has to be precise in the evidence it presents, we need to take steps to make sure it is. Making a mistake is not a problem, it is all about how you deal with it.
So far the
Statcheck tool is only checking if the math behind the statistical calculations in the published articles are wrong when the null-hypothesis significance testing has been used. What you refer to as reporting errors in your article from December last year published in Behaviour Research Methods. But these findings aren’t problematic as long as the conclusions in the articles aren’t affected by the reporting errors?
They aren’t problematic?—who is the judge of whether errors aren’t problematic? If you consider just statistical significance, there are still 1/8 papers that contain such a problem. Moreover, all errors in reported results affect meta-analyses — is that not also problematic down-the-line? I find it showing of hubris for any individual to say that they can determine whether something is problematic or not, when there can be many things that that person doesn’t realize even can be affected. It should be open to discussion, so information about problems need to be shared and discussed. This is exactly what I aimed to do with the
statcheck reports on PubPeer for a very specific problem.
In the article in Behaviour Research Methods you find that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. And that One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. What does this mean? I’m not a mathematician.
You don’t need to be a mathematician to understand this. Say we have a set of eight research articles presenting statistical results with certain conclusions. Four of those eight will contain a result that does not match up to the results presented (i.e., inconsistent), but does not affect the broad strokes of the conclusion. One of those eight contains a result that does not match up to the conclusion and potentially nullifies the conclusions. For example, if a study contains a result that does not match up with the conclusion, but concluded that a new behavioral therapy is effective at treating depression. That means the evidence for the therapy effectiveness is undermined — affecting direct clinical benefits as a result.
Why are these findings important?
Science is vital to our society. Science is based on empirical evidence. Hence, it is vital to our society that empirical evidence is precise and not distorted by preventable or remediable mistakes. Researchers make mistakes, no big deal. People like to believe scientists are more objective and more precise than other humans — but we’re not. The way we build checks- and balances to prevent mistakes from proliferating and propagating into (for example) policy is crucial.
statcheck contributes to understanding and correcting one specific aspect of such mistakes we can all make.
Why did you decide to run the
statcheck on psychology papers specifically?
statcheck was designed to extract statistical results reported as prescribed by the American Psychological Association. It is one of the most standardized ways of reporting statistical results. It makes sense to apply software developed on standards in psychology to psychology.
Why do you find so many statistical errors in psychology papers specifically?
I don’t think this is a problem to psychology specifically, but more a problem of how empirical evidence is reported and how manuscripts are written.
Are psychologists not as skilled at doing statistical calculations as other scholars?
I don’t think psychologists are worse at doing statistical calculations. I think point-and-click software has made it easy for scholars to compute statistical results, but not to insert them into manuscripts reliably. Typing in those results is error prone. I make mistakes when I’m doing my finances at home, because I have to copy the numbers. I wish I had something like
statcheck for my finances. But I don’t. For scientific results, I promote writing manuscripts dynamically. This means that you no longer type in the results manually, but inject the code that contains the result. This is already possible with tools such as Rmarkdown and can greatly increase the productivity of the researcher. It has saved my skin multiple times, although you still have to be vigilant for mistakes (wrong code produces wrong results).
Have you run the
Statcheck tool on your own statistical NHST-testing in the mentioned article?
Yes! This was the first thing I did, way before I was running it on other papers. Moreover, I was non-selective when I started scanning other people’s papers — I apparently even made a
statcheck report that got posted on PubPeer for my supervisor (see here). He laughed, because the paper was on reporting inconsistencies and the gross inconsistency was simply an example of one in the running text. A false positive, highlighting that
statcheck‘s results always need to be checked by a human before concluding anything definitive.
Statcheck “a new form of harassment” and accuse you of being “a self appointed data police”. Can you understand these reactions?
statcheck praise it as a good service. Researchers who study how researchers conduct research are called methodological terrorists. Any change comes with proponents and critics. Am I a self-appointed data policer? To some, maybe. To others, I am simply providing a service. I don’t chase individuals and I am not interested in that at all — I do not see myself as part of a “data police”. That people think these reports is like getting reprimanded highlights to me that there still rests a taboo on skepticism within science. Skepticism is one of the ideals of science, so let’s aim for that.
Why do you find it necessary to send out thousands of emails to scholars around the world informing them that their work has been reviewed and point out to them if they have miscalculated?
It was not necessary — I thought it was worthwhile. Why do some scholars find it necessary to e-mail a colleague about their thoughts on a paper? Because they think it is worthwhile and can help them or the original authors. Exactly my intentions by teaming up with PubPeer and posting those 50,000
Isn’t it necessary and important for ethical reasons to be able to make a distinction between deliberate miscalculations and miscalculations by mistake when you do this kind of statcheck?
If I was making accusations about gross incompetence towards the original authors, such a distinction would clearly be needed. But I did not make accusations at all. I simply stated the information available, without any normative or judging statements. Mass-scale post-publication peer review of course brings with it ethical problems, which I carefully weighed before I started posting
statcheck reports with the PubPeer team. The formulation of these reports was discussed within our group and we all agreed this was worthwhile to do.
As a journalist I can write and publish an article with one or two factual errors. This doesn’t mean the article isn’t of a general high journalistic standard or that the content of the article isn’t of great relevance for the public- couldn’t you make the same argument about a scientific article? And when you catalogue these errors online you are at the risk of blowing up a storm in a tea cup and turn everybody’s eyes away from the actual scientific findings?
Journalists and scholars are playing different games. An offside in football is not a problem in tennis and the comparison between journalists and scholars seems similar to me. I am not saying that an article is worthless if it contains an inconsistency, I just say that it is worth looking at before building new research lines on it. Psychology has wasted millions and millions of euros/dollars/pounds/etc on chasing ephemeral effects that are totally unreasonable, as several replication projects have highlighted in the last years. Moreover, I think the general opinion of science will only improve if we are more skeptical and critical of each other instead of trusting findings based on reputation, historical precedent, or ease with which we can assimilate the findings.