Why I content mine

The prevalence of data fabrication in science has only been investigated via self-report surveys, which incorporate severe biases and only assess the prevalence amongst researchers. Statistical results reported in a paper can be analyzed for overconsistency, which can indicate potential data fabrication, humans are bad at generating random sequences (e.g., many people assess the probability of tails as larger than .5 if previous flips resulted in heads only) and this reflects itself in fabricated data.

To apply these statistical methods to detect data fabrication, the data reported in a paper must first be extracted. To this end, I looked to content mining research articles in the literature. Raw and summary data are reported throughout a paper in the main text, in the footnotes, in figures, and in tables. Some of the results can be readily extracted by both humans and machines when they are reported in a standardized format, but machine-reading gives a massive efficiency increase and increases the scope. Other data, such as extracting data from figures, necessitate machine-reading because it requires precision at the level of a pixel.

With the extracted results, I looked to test whether results in a paper were potentially fabricated. I looked to do this for all papers in a field, to estimate the prevalence of data fabrication via an innovative method, that decreases the bias due to self-report and allows us to asses prevalence of data fabrication amongst research articles.