Assembly Comparison – Oly Assemblies Using Quast

I ran Quast to compare all of our current Olympia oyster genome assemblies.

See Jupyter Notebook in Results section for Quast execution.

Results:

Heatmapped table of results: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_14_12_30_25/report.html

Very enlightening!

BEST OF:

Largest Contig: redundans_sjw_02 (322,397bp)
Total Length: soap_bgi_01 & pbjelly_sjw_01 (697,528,655bp)
Total Length (>=50,000bp): redundans_sjw_03 (17,006,058bp)
N50: redundans_sjw_03 (17,679bp)

Interesting tidbit: The pbjelly_sjw_01 assembly is EXACTLY the same as the soap_bgi_01. Looking at the output messages from that PB Jelly assembly, one can see why. The messages indicate that no gaps were filled on the BGI scaffold reference! That means the PB Jelly output is just the BGI scaffold reference assembly!

Jupyter Notebook (GitHub): 20171114_swoose_oly_assembly_comparisons_quast.ipynb

Genome Assembly – Olympia oyster Illumina & PacBio Reads Using Redundans

Had problems with Docker and Jupyter Notebook inexplicably dying and deleting all the files in the working directory of the Jupyter Notebook (which also happened to be the volume mounted in the Docker container).

So, I ran this on my computer, but didn’t have Jupyter installed (yet).

This utilized the Canu contigs file (FASTA) that I generated on 20171018.

Here’s the input command:

sudo python /home/sam/software/redundans/redundans.py -t 24 -l m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq.gz m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz -i 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz -f 20171018_oly_pacbio.contigs.fasta -o /home/data/20171024_docker_oly_redundans_01/

This completed in just over 19hrs.

Copied output files to Owl: http://owl.fish.washington.edu/Athaliana/20171024_docker_oly_redundans_01/

Here’s the desired output file (FASTA): scaffolds.reduced.fa

Will add to our genome assemblies table.

Ran Quast on 20171103 for some assembly stats.

Quast output is here: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_03_22_43_06/

Genome Assembly – Olympia oyster Redundans/Canu vs. Redundans/Racon

Decided to compare the Redundans using Canu as reference and Redundans using Racon as reference. Both reference assemblies were just our PacBio data.

Jupyter notebook (GitHub): 20171005_docker_oly_redundans.ipynb

Notebook is also embedded at the end of this post.

Results:

It should be noted that the paired reads for each of the BGI mate-pair Illumina data did not assemble, just like last time I used them:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

Redundans with Canu is better, suggesting that the Canu assembly is the better of the two PacBio assemblies (which we had already suspected).

QUAST comparison using default settings:

QUAST comparison using –scaffolds setting:

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

First, I ran QUAST with the default settings:

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

Assembly Comparisons – Olympia oyster genome assemblies

— UPDATE 20171009 —

Having run through this a bunch of times now, I realized that the analysis below incorrectly identifies the outputs from Sean’s Redundans runs. The correct output from each of those runs should be the “scaffolds.reduced.fa” FAST files. The “contigs.fa” files that I linked to below are actually the assemblies produced by other programs; which are required as an input for Redudans.

I recently completed an assembly of the UW PacBio sequencing data using Racon and wanted some assembly stats, as well as a way to compare this assembly to the assemblies Sean had completed.

Additionally, Steven recently performed an assembly comparison and I noticed he got some odd results. Specifically, of the three assemblies he compared (PacBio x 1, Illumina x 2), both of the Illumina assemblies had a large quantity of “Ns” in the assemblies. This didn’t seem right and the comparison program he used (QUAST) spit out a message indicating that it seemed like scaffolds were used, instead of contigs. So, I thought I’d give it a shot and see if I could track down non-scaffolded assemblies produced by Sean.

Jupyter notebook (GitHub): 20171003_docker_oly_assembly_comparisons.ipynb

First, I compared the following six assemblies (FASTA files) using QUAST:

Sean’s Assemblies:

Sam’s Assembly:

QUAST output directory: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/

Here’s the assembly comparison of all assemblies (click on image for larger view):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/report.html

The first thing that jumps out to me is the fact that two of the Illumina assemblies, which used different assemblers(!!) have the EXACT same assembly stats. This occurrence seems extremely unlikely. I’ve double-checked my Jupyter notebook to make sure that I didn’t assign the same file by accident (see Input #6)

Very strange!

I also noticed that the first Redundans assembly of Sean’s has a ton of “Ns”, suggesting that it’s actually a scaffolded assembly. As with Steven’s QUAST run, QUAST spits out the messages suggesting to use the “–scaffold” option for this file.

The other thing I noticed is the two PacBio assemblies (Canu & Racon) have a huge difference in the total number of bp (~13,000,000)! I ran a QUAST assembly comparison between just those two for easier viewing/comparison (http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/report.html

The fact that there is such a large discrepancy in the total number of bps between these two assemblies really leaves me to believe that I am missing a FASTQ file from my assembly. I’m going to go back and see if that is indeed the case or if this difference in the assemblies is real.

Here’s an embedded version of my Jupyter notebook: