# Data Management – Convert Oly PacBio H5 to FASTQ

After working with all of this Olympia oyster genome sequencing data, I remembered that we had an old, singular PacBio SMRT cell file (from June 2013). This file didn’t seem to be included in any recent assemblies of Sean’s or mine. This is most likely because we have it in the PacBio H5 format and not in FASTQ.

I installed PacBio’s pbh5tools on my computer (swoose), converted the file and moved it to owl/nightingales/O_lurida

python bash5tools.py /mnt/owl/nightingales/O_lurida/m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5 --outType fastq 

I generated an MD5 checksum and appended to the checksums.md5 file in /owl/nightingales/O_lurida using the following command:

md5sum m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq | awk '{print $2 " = "$1}' >> checksums.md5

The command above pipes the output to awk to format the output to match the existing format of the checksums.md5 file (i.e. filename = hash).

Will generate updated PacBio assemblies with Canu and/or Racon.

# Genome Assembly – Olympia oyster Redundans/Canu vs. Redundans/Racon

Decided to compare the Redundans using Canu as reference and Redundans using Racon as reference. Both reference assemblies were just our PacBio data.

Jupyter notebook (GitHub): 20171005_docker_oly_redundans.ipynb

Notebook is also embedded at the end of this post.

Results:

It should be noted that the paired reads for each of the BGI mate-pair Illumina data did not assemble, just like last time I used them:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

Redundans with Canu is better, suggesting that the Canu assembly is the better of the two PacBio assemblies (which we had already suspected).

QUAST comparison using default settings:

QUAST comparison using –scaffolds setting:

# Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
• 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

First, I ran QUAST with the default settings:

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

# Genome Assembly – minimap/miniasm/racon Overview

Previously, I used the following three tools to do quick assembly of our Olympia oyster PacBio data:

I’m just posting this quick overview to make it easier to follow what was actually done without having to read through three different notebook entries and corresponding Jupyter notebooks.

When I say “quick assembly”, I mean it. The entire assembly process probably takes about an hour on the computer I used – that seems fast.

Here’s the quick and dirty of what was done:

#### 1 Run minimap:

This uses a pre-built set of defaults (the ava-pb in the code below) for analyzing PacBio data. Minimap only accepts two FASTQ files and you need to map your FASTQ file against itself. So, if you have multiple FASTQ sequencing files, you have to concatenate them into a single file prior to running minimap.

minimap2 -x ava-pb -t 23 \
20170911_oly_pacbio_cat.fastq \
20170911_oly_pacbio_cat.fastq \
> 20170911_minimap2_pacbio_oly.paf

#### 2 Run miniasm:

This uses your concatenated FASTQ file and the PAF file output from the miniasm step. The code below is taken from the example provided in the miniasm documentation; there are other options available.

miniasm \
-f \
/home/data/20170911_oly_pacbio_cat.fastq /home/data/20170911_minimap2_pacbio_oly.paf > /home/data/20170918_oly_pacbio_miniasm_reads.gfa

#### 3 Convert miniasm output GFA to FASTA

The FASTA file is needed to re-run minimap in Step 4 below.

awk '$1 ~/S/ {print ">"$2"\n"\$3}' 20170918_oly_pacbio_miniasm_reads.gfa > 20170918_oly_pacbio_miniasm_reads.fasta

#### 4 Run minimap with default settings

Using the default settings maps the FASTQ reads back to the contigs (the PAF file) created in the fist step. These mappings are required for Racon assembly (Step 5).

minimap2 \
-t 23 \
20170918_oly_pacbio_miniasm_reads.fasta 20170905_minimap2_pacibio_oly.paf > 20170918_minimap2_mapping_fasta_oly_pacbio.paf

#### 5 Run racon

The output file is the FASTA file listed below.

racon -t 24 \
20170911_oly_pacbio_cat.fastq \
20170918_oly_pacbio_minimap_mappings.paf \
20170918_oly_pacbio_miniasm_assembly.gfa \
20170918_oly_pacbio_racon1_consensus.fasta

# Assembly Comparisons – Olympia oyster genome assemblies

### — UPDATE 20171009 —

Having run through this a bunch of times now, I realized that the analysis below incorrectly identifies the outputs from Sean’s Redundans runs. The correct output from each of those runs should be the “scaffolds.reduced.fa” FAST files. The “contigs.fa” files that I linked to below are actually the assemblies produced by other programs; which are required as an input for Redudans.

I recently completed an assembly of the UW PacBio sequencing data using Racon and wanted some assembly stats, as well as a way to compare this assembly to the assemblies Sean had completed.

Additionally, Steven recently performed an assembly comparison and I noticed he got some odd results. Specifically, of the three assemblies he compared (PacBio x 1, Illumina x 2), both of the Illumina assemblies had a large quantity of “Ns” in the assemblies. This didn’t seem right and the comparison program he used (QUAST) spit out a message indicating that it seemed like scaffolds were used, instead of contigs. So, I thought I’d give it a shot and see if I could track down non-scaffolded assemblies produced by Sean.

Jupyter notebook (GitHub): 20171003_docker_oly_assembly_comparisons.ipynb

First, I compared the following six assemblies (FASTA files) using QUAST:

Sean’s Assemblies:

Sam’s Assembly:

QUAST output directory: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/

Here’s the assembly comparison of all assemblies (click on image for larger view):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/report.html

The first thing that jumps out to me is the fact that two of the Illumina assemblies, which used different assemblers(!!) have the EXACT same assembly stats. This occurrence seems extremely unlikely. I’ve double-checked my Jupyter notebook to make sure that I didn’t assign the same file by accident (see Input #6)

Very strange!

I also noticed that the first Redundans assembly of Sean’s has a ton of “Ns”, suggesting that it’s actually a scaffolded assembly. As with Steven’s QUAST run, QUAST spits out the messages suggesting to use the “–scaffold” option for this file.

The other thing I noticed is the two PacBio assemblies (Canu & Racon) have a huge difference in the total number of bp (~13,000,000)! I ran a QUAST assembly comparison between just those two for easier viewing/comparison (http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/report.html

The fact that there is such a large discrepancy in the total number of bps between these two assemblies really leaves me to believe that I am missing a FASTQ file from my assembly. I’m going to go back and see if that is indeed the case or if this difference in the assemblies is real.

Here’s an embedded version of my Jupyter notebook:

# Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

Previously, ran the first part of the pipeline: minimap

This notebook entry just contains the miniasm execution. Will follow with racon.

Jupyter Notebook (GitHub): 20170918_docker_pacbio_oly_miniasm0.2.ipynb

# Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the initial minimap execution. Followed up with miniasm and then racon.

Jupyter Notebook (GitHub): 20170907_docker_pacbio_oly_minimap2.ipynb

# Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

• Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
• Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
• I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
• Notebook entry

Canu

Redundans

Platanus

# Data Management – SRA Submission Olympia Oyster UW PacBio Data from 20170323

Submitted the FASTQ files from the UW PacBio Data from 20170323 to the NCBI sequence read archive (SRA).

FTP’d the data to NCBI’s servers, following their instructions. Briefly,

Change to the directory where the FASTQ files are (Owl/web/nightingales/O_lurida) and then initiate an FTP session:

ftp -i ftp-private.ncbi.nlm.nih.gov

Enter provided username/password, change to my designated uploads directory, create new folder dedicate to this particular upload. Then, upload all the files using the mput command:

mput *filtered_subreads*

SRA deets are below (assigned FASTQ files to existing BioProject and created a new BioSample). Will update post with SRA number when processing is complete on the NCBI end.

SRA: SRS2339870
Study: SRR5809355
BioProject: PRJNA316624
BioSample: SAMN07326085

# Data Management – Tarball of Olympia oyster UW PacBio Data from 20170323

I’d previously attempted to archive this data set on multiple occasions, across multiple days, but network dropouts kept killing my connection to the server (Owl) and, in turn, interrupting the tarball operation.

Today, I came in to a successful creation of the tarball of this PacBio data set (it only took 10hrs)! And, it’s a big file: 162GB!! Remember, that’s the compressed size!

Now, we’ll have to decide where we want to keep the tarball. I guess this’ll be part of our next data management plan discussions.