The National Science Foundation requires a data management plan to accompany all proposals submitted to the NSF from January 18, 2011 onward (NSF 2016). Despite this organizational wide mandate, most researchers and graduate students are typically not trained in best practices regarding data management, data sharing, or which tools and resources are available for archiving data and metadata. In this post, we answer some common questions you may have regarding data management plans and provide a brief review on free, open-source tools and resources available for handling data.
So, let’s get started!
What is the purpose of a data management plan?
To answer this question, it is best to think about the lifespan of data (and its corresponding metadata) (Figure 1). The amount of information known about data and metadata is at its highest level during the time of data collection/time of publication. Shortly after, specific details about data points, collection, etc. may be lost. Unforeseen things might happen (e.g., fire, floods, hard drive failure) during the lifetime of a researcher that would result in the loss of very valuable information regarding a dataset.
Think about viewing a dataset a year, 5 years, or 10 years after you initially collected the data, how many specific details would you remember without consulting your notes? And if you happen to be like me, you might spend a good hour or two just trying to decipher your own notes! Furthermore, would you remember the exact statistical tests you used for each analysis? This is why data management is extremely important. It allows researchers to safeguard their data, metadata, and any important information regarding data collection against the peril of time. This is to enable future researchers to utilize a dataset with the same amount of knowledge and confidence the data collector had at the point of collection.
A data management plan is simply a document that shows that you, as a researcher, have thought about these issues and have considered the best course of action to safeguard your data over time.
What exactly is metadata?
Simply put, metadata is data about data. It is information about a dataset so that a researcher who has no knowledge about a particular study can look at a dataset and understand how it was collected, what the codes/symbols represent, what are the caveats associated with the data, etc. This should include an in-depth description of why and how the data were collected—this is akin to the methods section of a research article but may be more detailed.
Each column header of the dataset should have an explanation. You, as the data collector, may very well know what TD_mm means but an outsider will definitely not. Similarly, the unit of observation should not be taken for granted. A column may be labeled Length, which is easy enough to interpret on its own, but is the length in question in milimeters, centimeters, meters, or is it something completely different and is in inches? And lastly, always explain what a 0 data point means. Sometimes, a 0 is used to indicate a lack of data. But other times, 0 may be an actual data point. It is very important to differentiate the two.
Where Do I Store My Data?
It has become increasingly common to deposit data onto online repositories for preservation. Lab and field notebooks constitute a primary collection of data. These are then transferred onto a secondary source, typically a computer. Most researchers would probably stop here. But there is an extra step to ensure the safety of your data over time—depositing it into an online repository for future access. Just as your lab or field notebook may be lost in a move, or a fire, or a flood, a hard drive may crash and fail. Now you may do what professors in the past have suggested, which is to keep 3 or 4 copies of your data at different places (your home, your office, a safe, etc.) but this still does not guarantee the safety of your data. An online data repository is a much safer place to store your data because not only is it likely to be backed by several servers all around the world, it also allows for universal access, increased exposure, and guarantees long-term preservation of your data.
Depending on your discipline and field, there are different repositories for you to consider. Below are the repositories that I have personal experience with. A simple search online data repository plus your discipline will yield a list of results where you can best store your data.
Social and Behavioral Sciences—the Inter-university Consortium for Political and Social Research (ICPSR) is hosted by the University of Michigan and is likely the largest social science repository (http://www.icpsr.umich.edu/icpsrweb/landing.jsp). If you want or need help curating your data, the ICPSR charges a fee based on your needs. If, however, you have your metadata in place, your data is clean and ready to be uploaded, and you want other researchers to be able to freely access your data, use openICPSR (https://www.openicpsr.org), the free version of ICPSR.
Ecological, Environmental, and Earth Sciences—Check out DataONE (https://www.dataone.org), and the National Ecological Observatory Network (NEON) (http://www.neonscience.org).
Computer Science and Statistical Codes—GitHub
Multidisciplinary—Figshare (https://figshare.com) is a great place to store data for any discipline. It also allows you deposit publications, figures, posters, and presentations to its repository. Best of all, it can create a DOI for any of the above mentioned things.
If you have any data management tips or tools that you have found helpful, please share them in the comment section below. Best of luck on creating your next data management plan!
References
Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G., 1997. Nongeospatial metadata for the ecological sciences. Ecological Applications, 7(1), pp.330-342.
NSF. 2016. Dissemination and Sharing of Research Results. Available at: https://www.nsf.gov/bfa/dias/policy/dmp.jsp.