Validly closed data must be secured

tl;dr if you cannot share your data due to valid reasons, simply not sharing is insufficient if you don’t also create a way to keep those data secure.

I am a big proponent of publicly sharing data (skip the advocacy/propaganda by skipping to paragraph 4), because “interpretations may change, but raw data will remain unchanged forever”. You might disagree with the interpretation and want to reanalyze the data, or in the future, when analysis methods have improved, an interpretation can be strengthened by better analyses. Maybe the data will be used to find answers for other questions (because “the best thing to do with your data will be thought of by someone else”). Maybe the data will be combined with another dataset to research something. The reuse potential of research data is massive — and legitimate.

I find sharing data convincing from a moral perspective because of the aforementioned reasons (amongst others), but there is also a utilitarian aspect to it. If you play the citation game, there is some evidence you get more citations if you share your data. If you share well documented data, it will also save you time when you receive questions about your results. Additionally, anticipating such questions also serves as a reflection exercise, strengthening your own research process. These aren’t all the reasons, but just some examples.

However, data sharing means nothing when data are not properly documented or prepared for (re)use. Dumping a dataset online with the variables V1, V2, through V99 will mean nothing without a clear codebook. Moreover, if data are shared in proprietary formats (such as .xlsx), it might be non-trivial to open the file readily (now and in the future). PDFs with tables that contain data undermine the value of sharing as well, because that data is not machine readable (have you ever tried importing a PDF of data into any analysis program? Exactly). Machine readable also means that a spreadsheet shouldn’t be just visually appealling and easy to understand for humans (argh nested column headers!) but should be readily importable into an analysis program. Both by yourself and anyone else without having to manually adjust or convert those files. From experience, reusing data without documentation can be trivial and rather quick (half an hour; best case scenario), non-trivial and excruciatingly slow (several days; worst case scenario), or simply infeasible. The problem is you never now beforehand whether you or others can reuse your data, except if you set up your data to be shared (even with just your future self). Lastly, sharing a Dropbox link is not sustainable (when you move the file the link breaks). Just go ahead and deposit it in Zenodo (for example).

I know that there are valid reasons to (temporarily) not share data. Transparency cannot and should not be absolute,* otherwise it becomes problematic. For example, if a dataset contains medical data that can (easily) be used to identify a person and his/her medical conditions, sharing is unethical for both moral reasons (e.g., autonomy of the individual is harmed by indirectly disclosing personal information that is not the researcher’s to [indirectly] disclose) and utilitarian reasons (e.g., the individual might experience harmful consequences due to disclosure). A harmful consequence in this specific example of identifiable medical data might be that of insurance providers saving the medical conditions of their clients as a way of (partly) refusing a future claim or even simpler, the discomfort of people realizing that their personal information is now public. There are many scenarios possible, too many fully grasp.

But let us assume a researcher has data that are identifiable, with specific medical conditions, which that researcher does not share. Has this researcher taken sufficient measures to prevent leaks of personal data by simply not sharing them? At face value, no; securing something requires active behaviors preventing access, where not sharing is a passive behavior that does not promote access. Digging deeper, whether the data have been properly secured would require answering several operational questions about data handling such as (in no particular order and not exhaustive):

  1. Who is allowed to have (access to) the data?
  2. Who handled (parts of) the data at any stage during the collection process?
  3. Who could have copies of the data, but are not addressed in 1?
  4. Who should not have (access to) the data?
  5. Who has access to the data?
    • Does everyone in 1?
    • Does anyone from 4?
    • Are those from 3 that are not also in 4 a threat to further disclosure of the data?
  6. How are the data stored?
    • Which security measures are taken? (e.g., encryption, physical security layers)
    • Are security measures clearly agreed for everyone in 1?
  7. What measures are taken if (part of) the data are leaked in some (unforseeable) way?
    • Can damage to the individuals be mitigated?
    • Will individuals be compensated in any way if mitigating is impossible?

When data cannot be shared for valid reasons, securing those data is essential to prevent unwarranted and unintended access. This is the kicker: I think most researchers who don’t share data and say they cannot or should not be shared (e.g., for privacy reasons) don’t think about the problem deeply to really call what they’re doing (i.e., not sharing the data) a way of securing the data.

Securing something requires more active behavior than simply refraining from sharing. If such researchers do not secure their data, they fall victim to their own reasoning because their own rhetoric stresses the importance that data should not become available to others in any way. In such a situation, it seems to me that complacency with respect to data management is more at play in deciding not to share data than true concern about data security.

For example, if you cannot share data due to valid privacy concerns, but subsequently store those data in your dropbox account or have an unencrypted USB with the data lying around, you’re not securing the data properly. As such, if there is no clear threat model for the data, then there would also be no initial reason to keep the data private**. If there is a clear threat model, but it isn’t set up by the researcher, that researcher is exposing the data to the unnecessary risk of leaking despite the threat. Either way, data require attention: either by documenting the data for sharing if there is no threat, or by creating secure protocols to prevent leaks in the case of valid ethical threats.

All researchers need to think about how to manage their data, but many currently don’t. Some profer valid reasons to not share data, but fail to accept the consequences of their own rhetoric, calling in question how serious researchers take ethical considerations with respect to data integrity.


[*] Some researchers hide too easily behind this so if anyone refers to this blog as a reason why they won’t share, ask them for their exact arguments.

[**] However, there might be unknown unknowns

Leave a Reply