# GitHub Curation

Updated a couple of GitHub Wikis:

Created a new repo in the RobertsLab Organization GitHub account with a wiki to provide an overview of how to use of Hyak (mox) computing node. This was lightly modified from what Sean already had in his personal repo.

As a quick test, I updated all the md files in  the sr320/LabDocs/code md files to format headers for GitHub’s newest interpretation of headers. The headers (represented by a series of ‘#’) require a space between them and the subsequent text.  I used the following command in bash:

for i in *.md; do sed -i.bak 's/^#*/& /g' "\$i"; done

The code works as follows:

• Run for loop on all .md files in the directory

• Use sed to edit the files in place: -i.bak (this command structure is needed for Mac OS X).

• 's/^#*/&amp; /g': Performs a substitution by identifying all lines beginning (^) with a pound symbol (#) and match zero or more occurrences of the pound symbol (*), then substituting the same pattern that was matched and adding a space at the end of the pattern (& ). Do this for all occurrences found within the document (g).

Since this worked, I’ll probably run this through all of the md files in all of our various repos to quickly and easily fix header formatting issues.

Working on updating the Genome-sequencing-December-2016-(UW-PacBio) wiki, but need to work out the kinks on any easy, documentable way to rename and move some files around in order to make files/organization compliant with our data management plan (DMP).

Current strategy:

• Generate MD5 checksums for fastq files for each of the SMRT cell runs.

• Copy file names from the the .xml file in the top level of each SMRT cell run folder to an array.

• Use parameter substitution (in bash) to strip path and suffix from each index of the array (results likely stored in a secondary or tertiary array).

• Use bash find command to copy the filtered_subreads.fastq.gz from each SMRT cell run folder, and append each of the corresponding stripped filenames in the final array to the beginning of the fastq file, to the owl/nightingales/O_lurida directory.

• Generate new MD5 checksums on the copied files and compare to original MD5 checksums. This will confirm two things: 1 – The data did not get corrupted when copied. 2 – The new filenames correspond to the correct, original filtered_subreads.fastq.gz file (renaming a file doesn’t alter the MD5 checksum).

• Archive the original SMRT cell run folders (which contain a ton of metdata files)

# Validly closed data must be secured

tl;dr if you cannot share your data due to valid reasons, simply not sharing is insufficient if you don’t also create a way to keep those data secure.

I am a big proponent of publicly sharing data (skip the advocacy/propaganda by skipping to paragraph 4), because “interpretations may change, but raw data will remain unchanged forever”. You might disagree with the interpretation and want to reanalyze the data, or in the future, when analysis methods have improved, an interpretation can be strengthened by better analyses. Maybe the data will be used to find answers for other questions (because “the best thing to do with your data will be thought of by someone else”). Maybe the data will be combined with another dataset to research something. The reuse potential of research data is massive — and legitimate.

I find sharing data convincing from a moral perspective because of the aforementioned reasons (amongst others), but there is also a utilitarian aspect to it. If you play the citation game, there is some evidence you get more citations if you share your data. If you share well documented data, it will also save you time when you receive questions about your results. Additionally, anticipating such questions also serves as a reflection exercise, strengthening your own research process. These aren’t all the reasons, but just some examples.

However, data sharing means nothing when data are not properly documented or prepared for (re)use. Dumping a dataset online with the variables V1, V2, through V99 will mean nothing without a clear codebook. Moreover, if data are shared in proprietary formats (such as .xlsx), it might be non-trivial to open the file readily (now and in the future). PDFs with tables that contain data undermine the value of sharing as well, because that data is not machine readable (have you ever tried importing a PDF of data into any analysis program? Exactly). Machine readable also means that a spreadsheet shouldn’t be just visually appealling and easy to understand for humans (argh nested column headers!) but should be readily importable into an analysis program. Both by yourself and anyone else without having to manually adjust or convert those files. From experience, reusing data without documentation can be trivial and rather quick (half an hour; best case scenario), non-trivial and excruciatingly slow (several days; worst case scenario), or simply infeasible. The problem is you never now beforehand whether you or others can reuse your data, except if you set up your data to be shared (even with just your future self). Lastly, sharing a Dropbox link is not sustainable (when you move the file the link breaks). Just go ahead and deposit it in Zenodo (for example).

I know that there are valid reasons to (temporarily) not share data. Transparency cannot and should not be absolute,* otherwise it becomes problematic. For example, if a dataset contains medical data that can (easily) be used to identify a person and his/her medical conditions, sharing is unethical for both moral reasons (e.g., autonomy of the individual is harmed by indirectly disclosing personal information that is not the researcher’s to [indirectly] disclose) and utilitarian reasons (e.g., the individual might experience harmful consequences due to disclosure). A harmful consequence in this specific example of identifiable medical data might be that of insurance providers saving the medical conditions of their clients as a way of (partly) refusing a future claim or even simpler, the discomfort of people realizing that their personal information is now public. There are many scenarios possible, too many fully grasp.

But let us assume a researcher has data that are identifiable, with specific medical conditions, which that researcher does not share. Has this researcher taken sufficient measures to prevent leaks of personal data by simply not sharing them? At face value, no; securing something requires active behaviors preventing access, where not sharing is a passive behavior that does not promote access. Digging deeper, whether the data have been properly secured would require answering several operational questions about data handling such as (in no particular order and not exhaustive):

1. Who is allowed to have (access to) the data?
2. Who handled (parts of) the data at any stage during the collection process?
3. Who could have copies of the data, but are not addressed in 1?
4. Who should not have (access to) the data?
5. Who has access to the data?
• Does everyone in 1?
• Does anyone from 4?
• Are those from 3 that are not also in 4 a threat to further disclosure of the data?
6. How are the data stored?
• Which security measures are taken? (e.g., encryption, physical security layers)
• Are security measures clearly agreed for everyone in 1?
7. What measures are taken if (part of) the data are leaked in some (unforseeable) way?
• Can damage to the individuals be mitigated?
• Will individuals be compensated in any way if mitigating is impossible?

When data cannot be shared for valid reasons, securing those data is essential to prevent unwarranted and unintended access. This is the kicker: I think most researchers who don’t share data and say they cannot or should not be shared (e.g., for privacy reasons) don’t think about the problem deeply to really call what they’re doing (i.e., not sharing the data) a way of securing the data.

Securing something requires more active behavior than simply refraining from sharing. If such researchers do not secure their data, they fall victim to their own reasoning because their own rhetoric stresses the importance that data should not become available to others in any way. In such a situation, it seems to me that complacency with respect to data management is more at play in deciding not to share data than true concern about data security.

For example, if you cannot share data due to valid privacy concerns, but subsequently store those data in your dropbox account or have an unencrypted USB with the data lying around, you’re not securing the data properly. As such, if there is no clear threat model for the data, then there would also be no initial reason to keep the data private**. If there is a clear threat model, but it isn’t set up by the researcher, that researcher is exposing the data to the unnecessary risk of leaking despite the threat. Either way, data require attention: either by documenting the data for sharing if there is no threat, or by creating secure protocols to prevent leaks in the case of valid ethical threats.

All researchers need to think about how to manage their data, but many currently don’t. Some profer valid reasons to not share data, but fail to accept the consequences of their own rhetoric, calling in question how serious researchers take ethical considerations with respect to data integrity.

Notes:

[*] Some researchers hide too easily behind this so if anyone refers to this blog as a reason why they won’t share, ask them for their exact arguments.

[**] However, there might be unknown unknowns

# DNA Methylation Quantification – Acropora cervicornis (Staghorn coral) DNA from Javier Casariego (FIU)

Used the MethylFlash Methylated DNA Quantification Kit (Colorimetric) from Epigentek to quantify methylation in these coral DNA samples.

All samples were run in duplicate <em>except</em> 2h Block 1 due to insufficient DNA.

The following samples were used in a 1:10 dilution (2uL DNA : 18uL NanoPure H2O), due to their relatively high concentrations, to ensure accurate pipetting:

• 72h Block 4
• D14 Block 1
• D14 Block 2
• D14 Block 3
• D14 Block 4
• D14 Block 5
• D14 Block 6
• D14 Block 8
• D14 Block 10

All samples were diluted to a final concentration of 9.645ng/uL (154.24ng total; 17.6uL) in NanoPure water, which is equal to 77.12ng of DNA per assay replicate. These numbers were chosen based off of the sample with the lowest concentration.

The following samples were used in their entirety:

• 2h Block 8
• D35 Block 8

Calculations were added to the spreadsheet provided by Javier (Google Sheet): A.cervicornis_DNA_Extractions(May_2017).xlsx

The spreadsheet became overly complicated because I initially forgot to account for the need to run each sample in duplicate.

The kit reagent dilutions were as follows:

• Diluted ME1: 52mL of ME1 + 464mL of <em>distilled</em> water
• Diluted ME4: 10uL of ME4 + 10uL of TE Buffer (pH=8.0; made by me on 20130408).
• Standard curve: Prepped per instruction manual, with double volumes for two plates.
• Diluted ME5: 50uL/well x 152well = 7600uL; 7600uL/1000 = 7.6uL; 7.6uL ME5 + 7592.4uL Diluted ME1
• Diluted ME6: 50uL/well x 152well = 7600uL; 7600uL/2000 = 3.8uL; 3.8uL ME6 + 7596.2uL Diluted ME1
• Diluted ME7: 50uL/well x 152well = 7600uL; 7600uL/5000 = 1.52uL; 1.52uL ME7 + 7598.48uL Diluted ME1

All diluted solutions were stored on ice for duration of procedure.

The remaining Diluted ME1 solution was stored at 4C (FTR 209), and is stable for 6 months, per the manufacturer’s instructions.

See the Results section below for plate layouts.

Plates were read at 450nm on the Seeb Lab Victor 1420 Plate Reader (Perkin Elmer) and the amount of DNA methylation was determined.

Results:

Individual sample methylation quantification (Google Sheet): A.cervicornis_DNA_Extractions(May_2017).xlsx

Plate Reader Output File Plate #1 (Google Sheet): 20170511_coral_DNA_methylation_plate01.xls

Plate Reader Output File Plate #2 (Google Sheet): 20170511_coral_DNA_methylation_plate02.xls

I’m not familiar with the experimental design, so I’m not going to spend time handling any of the in-depth analysis at this point in time. However, here’s the background on how methylation quantification and percent methylation were determined.

1. Mean absorbance (450nm) was determined for all samples and standard curve samples. It’s important to note that the standard deviation between replicates was not evaluated and there appears to be consistent variability between samples, but I’m not certain how much variation is “acceptable” with and assay of this nature.

2. The mean absorbance of the standard curve samples were plotted against their corresponding DNA amounts and a linear trendline was fitted to the points.

3. Per the manufacturer’s recommendations, the four points (including the zero point) that yielded the best linear fit (i.e. best R^2 value) were used and the slope of best fit line for those four points was determined.

4. This slope was then utilized in the equation provided by the manufacturer (see pg. 8 of the MethylFlash Kit manual).

# DNA Quantification – Acropora cervicornis (Staghorn coral) DNA from Javier Casariego (FIU)

I quantified the three samples (listed below) that I SpeedVac’d yesterday using the the Roberts Lab Qubit 3.0.

• 2h Block 1
• 2h Block 8
• D35 Block 8

Quantification was performed using the dsDNA Broad Range Kit.

Used 1uL of each sample.

Results:

One sample (2h Block 1) is still slightly too dilute in order to use the recommended total amount of DNA for the methylation assay (100ng), but still falls well within the recommended range for the assay. Will proceed with the methylation assay for all samples.

Values were added to the spreadsheet provided by Javier (Google Sheet): A.cervicornis_DNA_Extractions(May_2017).xlsx

Qubit output file (Google Sheet): 20170511_qubit_A_cervicornis_DNA

# DNA Concentration – Acropora cervicornis (Staghorn coral) DNA from Javier Casariego (FIU)

Three samples (of the 62 total) that were quantified earlier today, had concentrations too low for use in the methylation assay:

• 2h Block 1
• 2h Block 8
• D35 Block 8

These samples were dried to completion in a SpeedVac.

They will be allowed to rehydrate O/N in 10uL of Buffer EB (Qiagen) and will be re-quantified tomorrow morning.

# DNA Quantification – Acropora cervicornis (Staghorn coral) DNA from Javier Casariego (FIU)

DNA samples received yesterday were quantified using the Roberts Lab Qubit 3.0 to improve quantification accuracy (samples provided by Javier were quantified via NanoDrop, which generally overestimates DNA concentration) prior to performing methylation assessment.

Quantification was performed using the dsDNA Broad Range Kit.

Used 1uL of each sample.

Results:

Three samples are too dilute for immediate use in the MethylFlash Methylated DNA Quantification Kit (Colorimetric) – max sample volume is 8uL. Will have to concentrate them (will likely use SpeedVac to prevent sample loss).

Values were added to the spreadsheet provided by Javier (Google Sheet): A.cervicornis_DNA_Extractions(May_2017).xlsx

Qubit output file (Google Sheet): 20170510_qubit_A_cervicornis_DNA

# DNA Quantification – RLO viability DNased RNA

I previously DNased RNA I isolated from water filters that were part of the RLO viability experiment that Lisa and the Capstone students are conducting. I checked for residual gDNA carryover via qPCR and all of the samples that were intended for dosing the abalone came up positive. It’s likely due to such a high quantity of algae that was co-filtered with the potential RLOs, leading to over-saturation of the RNAzol with DNA, resulting in the gDNA carryover.

In turn, I think the DNase treatment was insufficient for the quantity of carryover DNA.

I am planning on re-DNasing those samples, but want to quantify any residual DNA present to make sure that the samples aren’t still too concentrated for the DNase.

Samples were quantified using the Robert Lab Qubit 3.0 and the Qubit dsHS reagents (high sensitivity), using 1uL of sample.

Results:

Residual DNA is still present, but at levels that are well below the maximum that the DNase treatment (10ug) can handle. I will redo the DNase treatment on these samples. Spreadsheet is linked, and embedded below, with sample concentrations.

# qPCR – CDFW White Abalone Samples (RLOv DNA helicase)

Standard curve was from 20151106.

All samples were run in duplicate.

Master mix calcs are here; since I ran these with the other samples, the master mix used was part of the other project indicated in the spreadsheet (Google Sheet): 20170420 – qPCR RLOv DNA Helicase

Plate layout, cycling params, etc. can be found in the qPCR Report (see Results).

Baseline threshold was manually set to 580.5, as previously determined.

Results:

qPCR Report (PDF): Sam_2017-04-20 07-50-18_CC009827.pdf
qPCR Data File (CFX): Sam_2017-04-20 07-50-18_CC009827.pcrd

Standard curve looks good and all samples provided come up positive for RLOv DNA helicase.

I’ve compiled the raw data of both the WSN qPCR and this in this Google Sheet: 20170420_CDFW_White_Ab_qPCR_summary

Here’s a summary table of the results (copy numbers are mean copies from qPCR replicates):

 SAMPLE RLOV DNA HELICASE (COPIES) WSN1 (COPIES) SF16-76_DG-1 165318.58 169.25 SF16-76_DG-2 47839.81 20.70 SF16-76_PE-1 1036697.17 633.75 SF16-76_PE-2 46763.60 296.83 SF17-17 117.29 2.16

NOTE: The WSN1 copies for SF17-17 is below the accepted detection limit of the qPCR assay (i.e. < 3 copies).

Will share my notebooks and spreadsheet with Blythe at CDFW.

Amplification Plots

Green = Standard Curve

Blue = Samples

Red = No template control

# DNA Quantification – Black Abalone DNA (Black Ab Exp. 2)

Lisa recently isolated DNA from the following samples:

08:13-05 (Black Ab exp 2)
08:13-18 (Black Ab exp 2)
08:13-24 (Black Ab exp 2)
08:13-25 (Black Ab exp 2)

I quantified the samples using the Roberts Lab Qubit 3.0 with the Qubit ds High Sensitivity kit. Used 1uL of each sample.

Samples were stored in designated boxes in -20C in Rm. 240.

Results:

Qubit output (Google Sheet): 20170413_DNA_quantification_qubit

 SAMPLE CONCENTRATION (ng/uL) 08:13-05 62.4 08:13-18 0.536 08:13-24 0.454 08:13-25 8.8

NOTE: The entirety of sample 08:13-24 will be provided to Stan Langevin for high-throughput sequencing.

# qPCR – CDFW White Abalone Samples (WSN1)

The samples that CDFW sent us earlier are intended for checking for the presence of the RLOv (phage), but I figured it would be prudent to verify that they were positive for the RLO as well. I ran these samples concurrently with some other samples I had to test with the withering syndrome qPCR assay.

Standard curve was p18RK7 from 20161128.

All samples were run in duplicate.

Master mix calcs are here; since I ran these with the other samples, the master mix used was part of the other project indicated in the spreadsheet (Google Sheet): 20170406_qPCR_WSN1_capstone

Plate layout, cycling params, etc. can be found in the qPCR Report (see Results).

Baseline threshold was manually set to 580, as previously determined by Lisa.

Results:

Standard curve looks good and all samples provided come up positive for RLO.

qPCR Report (PDF): Sam_2017-04-06%2011-36-53_CC009827_CDFW_white_ab_WSN1.pdf
qPCR Data File (CFX): Sam_2017-04-06%2011-36-53_CC009827_CDFW_white_ab_WSN1.pcrd

Amplication Plots

Green = Standard Curve

Blue = Samples

Red = No template control

Standard Curve