# Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

Prepared a short read archive (SRA) submission for archiving our Olympia oyster genotype-by-sequencing (GBS) data in NCBI. This is in preparation for submission of the mansucript we’re putting together.

I followed my outline/guideline for navigating the SRA submission process, as it’s a bit of a pain in the neck. Glad my notes were actually useful!

The following two files are currently being uploaded via FTP; the process will take about 3hrs, as each file is ~18GB in size:

They are being submitted under the following accession numbers (note: a final accession number will be provided once this is publicly available; I will update this post when that happens):

# Manuscript Writing – The “Nuances” of Using Authorea

I’m currently trying to write a manuscript covering our genotype-by-sequencing data for the Olympia oyster using the Authorea.com platform and am encountering some issues that are a bit frustrating. Here’s what’s happening (and the ways I’ve managed to get around the problems).

PROBLEM: Authorea spits out a browser-crashing “unresponsive script” message (actually, lots and lots of them; clicking “Stop script” or “Continue” just results in additional messages) in Firefox (haven’t tried any other browsers). This renders the browser inoperable and I have to force quit. It doesn’t happen all of the time, so it’s hard to pinpoint what triggers this.

SOLUTION: Edit documents in Git/GitHub. I have my Authorea manuscript linked to a GitHub repo, which allows me to write without using Authorea.com. This is how I’ll be doing my writing the majority of the time anyway, but I would like to use Authorea.com to insert and manage citations…

PROBLEM: Authorea remains in a perpetual “saving…” state after inserting a citation. It also renders the page strangely, with HTML <br></br> tags (see the “Methods” section in the screen cap below).

SOLUTION: Type additional text somewhere, anywhere. This is an OK solution, but is particularly annoying if I just want to go through and add citations and have no intentions of doing any writing.

PROBLEM: Multi-author citations don’t get formatted with “et al.” By default, Authorea inserts all citations using the following LaTeX format:

cite{Elshire_2011}

Result: (Elshire 2011).

This is a problem because this reference has multiple authors and should be written as: (Elshire et al., 2011).

SOLUTION: Change citation format to:

citep{Elshire_2011}

Other citation formatting options can be found here (including multiple citations within one set of parentheses, and referring in-text author name with only publication year in parentheses):

How to add and manage citations and references in Authorea

PROBLEM: When a citation no longer exists in the manuscript, it still persists in the bibliography.

SOLUTION: A known bug with no current solution. Currently, have to delete them from the bibliography by hand (or, maybe figure out a way to do it programatically)…

PROBLEM: Cannot click-and-drag some references from Mendeley (haven’t tested other reference managers) without getting an error. To my knowledge, the BibTeX is valid, as it appears to be the same formatting as other references that can be inserted via the click-and-drag method. There are some references it won’t work for…

SOLUTION: Use the search bar in the citation insertion dialogue box. Not as convenient and slows down the workflow for citation insertion, but it works…

# Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb

# DNA Isolation – Ostrea lurida DNA for PacBio Sequencing

In an attempt to improve upon the partial genome assembly we received from BGI, we will be sending DNA to the UW PacBio core facility for additional sequencing.

Isolated DNA from mantle tissue from the same Ostrea lurida individual used for the BGI sequencing efforts. Tissue was collected by Brent & Steven on 20150812.

Used the E.Z.N.A. Mollusc Kit (Omega) to isolate DNA from two separate 50mg pieces of mantle tissue according to the manufacturer’s protocol, with the following changes:

• Samples were homogenized with plastic, disposable pestle in 350μL of ML1 Buffer
• Incubated homogenate at 60C for 1.5hrs
• No optional steps were used
• Performed three rounds of 24:1 chloroform:IAA treatment
• Eluted each in 50μL of Elution Buffer and pooled into a single sample

Quantified the DNA using the Qubit dsDNA BR Kit (Invitrogen). Used 1μL of DNA sample.

Concentration = 326ng/μL (Quant data is here [Google Sheet]: 20161214_gDNA_Olurida_qubit_quant

Yield is good and we have more than enough (~5μg is required for sequencing) to proceed with sequencing.

Evaluated gDNA quality (i.e. integrity) by running ~500ng (1.5μL) of sample on 0.8% agarose, low-TAE gel stained with ethidium bromide.

Used 5μL of O’GeneRuler DNA Ladder Mix (ThermoFisher).

Results:

Overall, the gel looks OK. A fair amount of smearing, but a strong, high molecular weight band is present. The intensity of the smearing is likely due to the fact that the gel is overloaded for this particular well size. If I had used a broader comb and/or loaded less DNA, the band would be more defined and the smearing would be less prominent.

Will submit sample to the UW PacBio facility tomorrow!

We received info to download the final data and genome assembly files for geoduck and Olympia oyster from BGI.

In total, the downloads took a little over three days to complete!

The notebook detailing how the files were downloaded is below, but it should be noted that I had to strip the output cells because the output from the download command made the file too large to upload to GitHub, and the size of the notebook file would constantly crash the browser/computer that it was opened in. So, the notebook below is here for posterity.

# Data Analysis – Initial O.lurida Fst Determination from GBS Data

Finally running some analysis on the output from my PyRad analysison 20160727.

I’m following Katherine Silliman’s Jupyter notebook (2bRAD Subset Population Structure Analysis.ipynb) as a guide.

The initial analysis (which isn’t much) is in the Jupyter notebook below. The analysis will be continued on a later date.

Jupyter notebook: 20161117_docker_oly_vcf_analysis.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

# Data Management – Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.

Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

• 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
• 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

# Computing – Retrieve data from Amazon EC2 Instance

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

# Oyster Sampling – Olympia Oyster OA Populations at Manchester

I helped Katherine Silliman with her oyster sampling today from her ocean acidification experiment with Olympia oysters (Ostrea lurida) at the Kenneth K. Chew Center for Shellfish Research & Restoration, which is housed at the NOAA Northwest Fisheries Science Center at Manchester in a partnership with the Puget Sound Restoration Fund (PSRF). We sampled the following tissues and stored in 1mL RNAlater:

• ctenidia (C)
• mantle (M)

When there was sufficient ctenidia tissue, an additional sample was stored in 75% ethanol for potential microbial analysis.

Tissue was collected from two oysters from each of the following oyster populations:

• British Columbia (BC)
• California (CA)
• Oregon (OR)

Oysters were sampled from each of the following tanks:

• 1A
• 2A
• 3A
• 4A
• 1B
• 2B
• 3B
• 4B

Tubes were labeled in the following fashion:

1. Population & Tank (e.g. OR3B)
2. Tag#
3. Tissue

If no tag was present on the oyster, the oyster was assigned a number (beginning at 150 and increased sequentially) and photographed with a ruler for future measurement. White colored tags were written with the number followed by the letter ‘W’ (e.g. 78W) – no tag color info was recorded for other tag colors.

Additionally, gonad developmental stage was roughly assessed: ripe, kinda ripe, or not ripe.

All info was recorded by Katherine in her notepad. All samples were retained by Katherine (not sure where she stored them).

Utensils were flame sterilized between oysters and gloves/work surfaces were washed with a 10% bleach solution between oysters.

Here are a few pics from the day: