We currently have a version (0.0.2) of the Ostrea lurida genome on CoGe. This is 38 scaffolds greater than 80k bp. Below is an effort to map gonad RNA-seq data to said genome.
Two male gonad and two female libraries were mapped to the genome using TopHat in Cyverse Discovery Environment.
Through the steps…
I moved the data in Discovery Environment to coge_data directory.
Will see what Expression Analysis does…
This created two files and corresponding tracks: read depth and BAM alignment
Will crank out other three libraries and soon will work on rough annotation.
This isn’t really a notebook entry – it’s more of a traditional blog post.
It’s a quick summary of the frustrations and struggles I’ve encountered while trying to analyze the Olympia oyster GBS data. Hopefully it will serve as a place holder for others to find (and avoid) some of the pitfalls I’ve encountered so far. But, mostly, this is just for me to vent…
- Using the Stacks program (on Hummingbird over the network to our server Owl) takes forever and, more importantly, consistently fails to complete the ustacks and cstacks programs.
- Using the Stacks program (on Hummingbird via external HDD connected through Firewire) takes forever (combined, process_radtags and ustacks has been running since 20160428; that’s eight days)!!! Granted, this is running on all 96 samples, but, regardless, this type of time frame is not very conducive to productivity.
The “raw” non-demultiplexed fastq files supplied by BGI have a ‘N’ in the barcode in the FASTQ header lines. This prevents Stacks (and possibly Tassel – I’ll get to this in a second) from being able to perform the demultiplexing. Here’s a screen shot of what I’m talking about:
- Cyverse has a program called Tassel that should be able to handle GBS data just like ours. However, it doesn’t produce the expected output to proceed to the second step. Although I haven’t tested it, it’s possible that the problem is related to the ‘N’ in the FASTQ header barcode sequence I mentioned above. I suspect it’s related because the first step in using Tassel is demultiplexing utilizing a supplied barcode keyfile.
Cyverse has Stacks installed, but in order to use it, someone has to build a Cyverse “app.” I’ve tried and the process is brutal. It’s not conducive for a program (that is really a suite of programs) like Stacks that has so many command line options and, depending on your input file types (e.g. “non-standard” Illumina filenames for paired-end sequencing), requires looping over filenames to specify corresponding file pairs.
Pyrad actually worked relatively well, but the VCF output file (for visualizing in the Integrative Genomics Viewer) has an ill-formed header that IGV won’t accept. Attempts at tweaking the header don’t seem to resolve the issue. Additionally, it’s not apparent in the output files if individuals get grouped, even though there is an option to specify which individuals should be grouped together.
And, the most frustrating thing of all???!!! I just realized how to handle the problematic barcodes in the FASTQ headers!! Instead of trying to alter the FASTQ files (which I’ve been messing around with over the past few days), all I’ve needed to do this entire time is CHANGE THE BARCODE KEY FILE THAT STACKS AND/OR TASSEL USES TO HAVE A ‘N’ AT THE BEGINNING OF EACH BARCODE!
I’m going to go cry now…
Regardless of that last one, it doesn’t change the fact that Stacks is painfully slow and, at times, unreliable.
Yesterday I uploaded v0.0.1 of the Geoduck genome to CoGe.
Now I want to start adding tracks. To do this I used CLC to create RNA-seq tracks from our male and female gonad transcriptome data.
As would be expected only a small amount of reads mapped. This is as we are limiting the genome to the 22 scaffolds with length > 100k.
One thing to point out (and will have to be followed up on) is that many more Female reads mapped back.
I took the Reads data and exported to BAM.
Then uploaded to CoGe.
I called this Version 1, and interestingly I got some cool options.. so I selected them.
This included saving as a Notebook.
This was Finished in less than 5 minutes!
The SNP view.
Voila – we have it in a Browser.
and you can zoom in
Here we have a Notebook view
It is now public, though not quite sure if there is a url.
Everything is public so please give it a look / twirl.
We have had the data for a draft genome of Panopea generosa for a bit. Here is a quick look.
All raw data is available @ http://owl.fish.washington.edu/nightingales/P_generosa/
With a first pass assembly here.
There are over 14 million scaffolds at this point with 22 scaffolds greater than 100,000 bp. We are using those to kick the tire of COGE and see if this is good portal for analysis and sharing.
There is not much to see now in the genome browser, but should hopefully have more soon.