Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

Yesterday, I ran PB Jelly using Sean’s Platanus assembly, but that didn’t produce an assembly because PB Jelly was expecting gaps in the Illumina reference assembly (i.e. scaffolds, not contigs).

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

OK! This seems to have worked (and it was quick, like less than an hour!), as it actually produced a FASTA file! Will run QUAST with this and some assemblies to compare assembly stats. Have added this assembly to our Olympia oyster genome assemblies table.

Jupyter Notebook (GitHub): 20171114_emu_pbjelly_BGI_scaffold.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/Platanus Assembly

Sean had previously attempted to run PB Jelly, but ran into some issues running on Hyak, so I decided to try this on Emu.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

This completed very quickly (like, just a couple of hours). I also didn’t experience the woes of multimillion temp file production that killed Sean’s attempt at running this on Mox (Hyak).

However, it doesn’t seem to have produced an assembly!

Looking through the output, it seems as though it didn’t produce an assembly because there weren’t any gaps to fill in the reference. This makes sense (in regards to the lack of gaps in the reference Illumina assembly) because I used the Platanus contig FASTA file (i.e. not a scaffolds file). I didn’t realize PB Jelly was just designed for gap filling. Guess I’ll give this another go using the BGI scaffold FASTA file and see what we get.

Jupyter Notebook (GitHub): 20171113_emu_pbjelly_22mer_plat.ipynb

Genome Assembly – Olympia oyster PacBio Canu v1.6

I decided to run Canu myself, since documentation for Sean’s Canu run is a bit lacking. Additionally, it looks like he specified a genome size of 500Mbp, which is probably too small. For this assembly, I set the genome size to 1.9Gbp (based on the info in the BGI assembly report, using 17-mers for calculating genome size), which is probably on the large size.

Additionally, I remembered we had an old PacBio run that we had been forgetting about and thought it would be nice to have incorporated into an assembly.

See all the messy details of this in the Jupyter Notebook below, but here’s the core info about this Canu assembly.

PacBio Input files (available on Owl/nightingales/O_lurida:

m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz                                                               m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz
m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz

\$time canu \
useGrid=false \
-p 20171009_oly_pacbio \
-d /home/data/20171018_oly_pacbio_canu/ \
genomeSize=1.9g \
correctedErrorRate=0.075 \
-pacbio-raw m*

Results:

Well, this took a LONG time to run; a bit over two days!

The report file contains some interesting tidbits. For instance:

• Unitgigging calculates only 1.84x coverage
• Trimming removed >5 billion (!!) bases: 867850 reads 5755379456 bases (reads with no overlaps, deleted)
• Unitigging unassembled: unassembled: 479693 sequences, total length 2277137864 bp

I’ll compare this Canu assembly against Sean’s Canu assembly and see how things look.

Report file (text file): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.report

Contigs Assembly (FASTA): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.contigs.fasta

Complete Canu output directory: http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/

Jupyter Notebook (GitHub): 20171018_docker_oly_canu.ipynb

Genome Assembly – Olympia oyster Redundans/Canu vs. Redundans/Racon

Decided to compare the Redundans using Canu as reference and Redundans using Racon as reference. Both reference assemblies were just our PacBio data.

Jupyter notebook (GitHub): 20171005_docker_oly_redundans.ipynb

Notebook is also embedded at the end of this post.

Results:

It should be noted that the paired reads for each of the BGI mate-pair Illumina data did not assemble, just like last time I used them:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

Redundans with Canu is better, suggesting that the Canu assembly is the better of the two PacBio assemblies (which we had already suspected).

QUAST comparison using default settings:

QUAST comparison using –scaffolds setting:

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

First, I ran QUAST with the default settings:

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

Assembly Comparisons – Olympia oyster genome assemblies

— UPDATE 20171009 —

Having run through this a bunch of times now, I realized that the analysis below incorrectly identifies the outputs from Sean’s Redundans runs. The correct output from each of those runs should be the “scaffolds.reduced.fa” FAST files. The “contigs.fa” files that I linked to below are actually the assemblies produced by other programs; which are required as an input for Redudans.

I recently completed an assembly of the UW PacBio sequencing data using Racon and wanted some assembly stats, as well as a way to compare this assembly to the assemblies Sean had completed.

Additionally, Steven recently performed an assembly comparison and I noticed he got some odd results. Specifically, of the three assemblies he compared (PacBio x 1, Illumina x 2), both of the Illumina assemblies had a large quantity of “Ns” in the assemblies. This didn’t seem right and the comparison program he used (QUAST) spit out a message indicating that it seemed like scaffolds were used, instead of contigs. So, I thought I’d give it a shot and see if I could track down non-scaffolded assemblies produced by Sean.

Jupyter notebook (GitHub): 20171003_docker_oly_assembly_comparisons.ipynb

First, I compared the following six assemblies (FASTA files) using QUAST:

Sean’s Assemblies:

Sam’s Assembly:

QUAST output directory: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/

Here’s the assembly comparison of all assemblies (click on image for larger view):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/report.html

The first thing that jumps out to me is the fact that two of the Illumina assemblies, which used different assemblers(!!) have the EXACT same assembly stats. This occurrence seems extremely unlikely. I’ve double-checked my Jupyter notebook to make sure that I didn’t assign the same file by accident (see Input #6)

Very strange!

I also noticed that the first Redundans assembly of Sean’s has a ton of “Ns”, suggesting that it’s actually a scaffolded assembly. As with Steven’s QUAST run, QUAST spits out the messages suggesting to use the “–scaffold” option for this file.

The other thing I noticed is the two PacBio assemblies (Canu & Racon) have a huge difference in the total number of bp (~13,000,000)! I ran a QUAST assembly comparison between just those two for easier viewing/comparison (http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/report.html

The fact that there is such a large discrepancy in the total number of bps between these two assemblies really leaves me to believe that I am missing a FASTQ file from my assembly. I’m going to go back and see if that is indeed the case or if this difference in the assemblies is real.

Here’s an embedded version of my Jupyter notebook:

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

Previously, ran the first part of the pipeline: minimap

This notebook entry just contains the miniasm execution. Will follow with racon.

Jupyter Notebook (GitHub): 20170918_docker_pacbio_oly_miniasm0.2.ipynb

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the initial minimap execution. Followed up with miniasm and then racon.

Jupyter Notebook (GitHub): 20170907_docker_pacbio_oly_minimap2.ipynb

Data Management – Olympia oyster UW PacBio Data from 20170323

Due to other priorities, getting this PacBio data sorted and prepped for our next gen sequencing data management plan (DMP) was put on the back burner. I finally got around to this, but it wasn’t all successful.

The primary failure is the inability to get the original data file archived as a gzipped tarball. The problem lies in loss of connection to Owl during the operation. This connection issue was recently noticed by Sean in his dealings with Hyak (mox). According to Sean, the Hyak (mox) people or UW IT ran some tests of their own on this connection and their results suggested that the connection issue is related to a network problem in FTR, and is not related to Owl itself. Whatever the case is, we need to have this issue addressed sometime soon…

Anyway, below is the Jupyter notebook that demonstrates the file manipulations I used to find, copy, rename, and verify data integrity of all the FASTQ files from this sequencing run.

Next up is to get these FASTQ files added to the DMP spreadsheets.

Jupyter notebook (GitHub): 20170622_oly_pacbio_data_management.ipynb

I’ve also embedded the notebook below, but it might be easier to view at the GitHub link provided above.

Computing – Oly BGI GBS Reproducibility; fail?

OK, so things have improved since the last attempt at getting this BGI script to run and demultiplex the raw data.

I played around with the index.lst file format (based on the error I received last time, it seemed like a good possibility that the file formatting was incorrect) and actually got the script to run to completion! Granted, it took over 16hrs (!!), but it completed!

See the Jupyter notebook link below.

Results:

Well, although the script finished and kicked out all the demultiplexed FASTQ files, the contents of the FASTQ files don’t match (the read counts differ between these results and the BGI files) the original set of demultiplexed files. I’m not entirely sure if this is to be expected or not, since the script allows for a single nucleotide mismatch when demultiplexing. Is it possible that the mismatch could be interpreted slightly differently each time this is run? I’m not certain.

Theoretically, you should get the same results every time…

Maybe I’ll re-run this again over the weekend and see how the results compare to this run and the original BGI demultiplexing…

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

Jupyter notebook (may be easier to view in GitHub link above):