# Data Management – SRA Submission Oly GBS Batch Submission

An earlier attempt at submitting these files failed.

I re-uploaded the failed files (indicated in my previous notebook entry linked above) and tried again.

It failed again, despite having successfully uploaded just minutes before.

I re-uploaded that “missing” file and tried again.

This time, it succeeded (and no end-of-stream error for the 1SN_1A file!)!

Will post here with the SRA accession number once it goes live!

# Computing – Oly BGI GBS Reproducibility; fail?

OK, so things have improved since the last attempt at getting this BGI script to run and demultiplex the raw data.

I played around with the index.lst file format (based on the error I received last time, it seemed like a good possibility that the file formatting was incorrect) and actually got the script to run to completion! Granted, it took over 16hrs (!!), but it completed!

See the Jupyter notebook link below.

Results:

Well, although the script finished and kicked out all the demultiplexed FASTQ files, the contents of the FASTQ files don’t match (the read counts differ between these results and the BGI files) the original set of demultiplexed files. I’m not entirely sure if this is to be expected or not, since the script allows for a single nucleotide mismatch when demultiplexing. Is it possible that the mismatch could be interpreted slightly differently each time this is run? I’m not certain.

Theoretically, you should get the same results every time…

Maybe I’ll re-run this again over the weekend and see how the results compare to this run and the original BGI demultiplexing…

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

Jupyter notebook (may be easier to view in GitHub link above):

# Computing – Oly BGI GBS Reproducibility Fail

Since we’re preparing a manuscript that relies on BGI’s manipulation/handling of the genotype-by-sequencing data, I attempted to could reproduce the demultiplexing steps that BGI used in order to perform the SNP/genotyping on these samples.

The key word in the above sentence is “attempted.” Ugh, what a massive waste of time it turned out to be. I’ve contacted BGI to get some help on this.

In the meantime, here’s a brief (actually, not as brief as I’d like) rundown of my struggles.

The demultiplexing software that BGI used is something called “iTools” which is bundled in this GitHub repo: Resqtools

To demutliplex, they ran a script called: split.sh

The script seems fairly straightforward. Here is what it contains:

iTools Fqtools splitpool
-Index index.lst
-Flag enzyme.txt
-MisMatch
-OutDir split


It tells the iTools program to use the Fqtools tool “splitpool” to operate on a pair of gzipped FASTQ files. It also utilizes an index file (index.lst) which contains all the barcodes needed to identify, and separate, the individual samples that were combined prior to sequencing.

The first bump in the road is the -Flag enzyme.txt portion of the code. BGI did not provide me with this file. I recently requested them to send me it (or its contents, since I suspected it was only a single line text file). They sent me the contents of the file:

CAGC
CTGC

The next problem is neither of those two sequences are the recognition site for the enzyme that was (supposedly) used: ApeKI. The recognition site for ApeKI is: GCWGC

Regardless, I decided to see if I could reproduce the demultiplexing using the info they’d provided me.

I cloned the Resqtools repo, changed into the Reseqtools/iTools directory and typed make.

This resulted in an error informing me that it could not find boost/spirit/core.hpp

Tried to run make in the Reseqtools/iTools directory and got the same error. Realized iTools might not be searching the system \$PATH (this turned out to be correct), so I moved the contents of the Boost folder to the iTools, ran make and got the same error. Turns out, the newest version of Boost doesn’t have that core.hpp file any more. Looking at the iTools documentation, iTools was built around Boost 1.44. OMG…

Downloaded Boost 1.44 and went through the same steps as above. This eliminated the missing core.hpp error!

But, of course, led to another error. The error:

"Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS"

That was related to something with newer versions of the GCC compiler (this is, essentially, built into the computer; it’s not worth trying to install/use old versions of GCC) trying to work with old versions of Boost. Found a patch for a config file here: libstdcpp3.hpp.patch

I made the appropriate edits to the file as shown in that link and ran make and it almost worked!

The current error is:

./src/Variants/soapsv-v1.02/include.h:15:16: fatal error: gd.h: No such file or directory

I gave up and contacted BGI to see if they can get me a functional version of iTools…

# FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb

# FASTQC – Oly BGI GBS Raw Illumina Data

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.

Results:

Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…

Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb

# Data Analysis – Oly GBS Data Using Stacks 1.37

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

# GBS Frustrations

This isn’t really a notebook entry – it’s more of a traditional blog post.

It’s a quick summary of the frustrations and struggles I’ve encountered while trying to analyze the Olympia oyster GBS data. Hopefully it will serve as a place holder for others to find (and avoid) some of the pitfalls I’ve encountered so far. But, mostly, this is just for me to vent…

1. Using the Stacks program (on Hummingbird over the network to our server Owl) takes forever and, more importantly, consistently fails to complete the ustacks and cstacks programs.

2. Using the Stacks program (on Hummingbird via external HDD connected through Firewire) takes forever (combined, process_radtags and ustacks has been running since 20160428; that’s eight days)!!! Granted, this is running on all 96 samples, but, regardless, this type of time frame is not very conducive to productivity.

3. The “raw” non-demultiplexed fastq files supplied by BGI have a ‘N’ in the barcode in the FASTQ header lines. This prevents Stacks (and possibly Tassel – I’ll get to this in a second) from being able to perform the demultiplexing. Here’s a screen shot of what I’m talking about:

1. Cyverse has a program called Tassel that should be able to handle GBS data just like ours. However, it doesn’t produce the expected output to proceed to the second step. Although I haven’t tested it, it’s possible that the problem is related to the ‘N’ in the FASTQ header barcode sequence I mentioned above. I suspect it’s related because the first step in using Tassel is demultiplexing utilizing a supplied barcode keyfile.
2. Cyverse has Stacks installed, but in order to use it, someone has to build a Cyverse “app.” I’ve tried and the process is brutal. It’s not conducive for a program (that is really a suite of programs) like Stacks that has so many command line options and, depending on your input file types (e.g. “non-standard” Illumina filenames for paired-end sequencing), requires looping over filenames to specify corresponding file pairs.

3. Pyrad actually worked relatively well, but the VCF output file (for visualizing in the Integrative Genomics Viewer) has an ill-formed header that IGV won’t accept. Attempts at tweaking the header don’t seem to resolve the issue. Additionally, it’s not apparent in the output files if individuals get grouped, even though there is an option to specify which individuals should be grouped together.

4. And, the most frustrating thing of all???!!!  I just realized how to handle the problematic barcodes in the FASTQ headers!! Instead of trying to alter the FASTQ files (which I’ve been messing around with over the past few days), all I’ve needed to do this entire time is CHANGE THE BARCODE KEY FILE THAT STACKS AND/OR TASSEL USES TO HAVE A ‘N’ AT THE BEGINNING OF EACH BARCODE!

I’m going to go cry now…

Regardless of that last one, it doesn’t change the fact that Stacks is painfully slow and, at times, unreliable.

# Data Management – O.lurida Raw BGI GBS FASTQ Data

BGI had previously supplied us with demultiplexed GBS FASTQ files. However, they had not provided us with the information/data on how those files were created. I contacted them and they’ve given us the two original FASTQ files, as well as the library index file and corresponding script they used for demultiplexing all of the files. The Jupyter (iPython) notebook below updates our checksum and readme files in our server directory that’s hosting the files: http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

See Jupyter Notebook below for processing details.

Jupyter Notebook: 20160427_Oly_GBS_data_management.ipynb

# Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS

# Data Management – O. lurida genotype-by-sequencing (GBS) data from BGI

We received a hard drive from BGI on 20160223 (while I was out on paternity leave) containing the Ostrea lurida GBS data.

Briefly, three sets (i.e. populations) of Olympia oyster tissue was collected from oysters raised in Oyster Bay and were sent to BGI for DNA extraction and GBS. A total of 23 individuals from each of the following three populations were sequenced (a grand total of 96 samples):

• 1HL – (Hood Canal, Long Spit)
• 1NF – (North Sound, Fidalgo Bay)
• 1SN – (South Sound, Oyster Bay)

An overview of this project can be viewed on our GitHub Olympia oyster wiki.

Data was copied from the HDD to the following location on Owl (our server): http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

The data was generated from paired-end Illumina sequencing, so there are two FASTQ files for each individual.

The files were analyzed to create a MD5 checksum, perform read counts, and create a readme (markdown format) file. This was performed in a Jupyter/iPython notebook (see below).

IMPORTANT NOTE: The directory where this data is housed was renamed AFTER the Jupyter notebook was run. As such, the directory listed above will not be seen in the Jupyter notebook.

Jupyter notebook file: 20160314_Olurida_GBS_data_management.ipynb

Notebook Viewer: 20160314_Olurida_GBS_data_management.ipynb