# NovaSeq Assembly – The Struggle is Real – Real Annoying!

Well, I continue to struggle to makek progress on assembling the geoduck Illumina NovaSeq data. Granted, there is a ton of data (374GB!!!!), but it’s still frustrating that we can’t get an assembly anywhere…

Here are some of the struggles so far:

SOAPdenovo2

JR-Assembler

• Can’t install one of the dependencies (SOAP error correction)
• Actually, I need to try the binary version of this, instead of the source version (the source version fails at the make step)

So, next up will trying the following two assemblers:

• JR-Assembler: Will see if SOAPec binary will work, and then run an assembly.
• AllPaths-LG: I was able to install this successfully on Mox.

Additionally, we’ve ordered some additional hard drives and will be converting the old head/master node on the Apple Xserve cluster to Linux. The old master node is a little better equipped than the other Apple Xserve “birds”, so will try to re-run Meraculous on it once we get it converted.

# Assembly – Geoduck Illumina NovaSeq SOAPdenovo2 on Mox (FAIL)

Trying to get the NovaSeq data assembled using SOAPdenovo2 on the Mox HPC node we have and it will not work.

Tried a couple of times and it hasn’t run successfully. Here are links to the files used on Mox (including the batch script and slurm output files). I made slight changes to the formatting of the batch script because I thought there was something wrong. Specifically, the slurm output file in the 20180215 runs does not accurately reflect the command I issued (i.e. 1> ass.log is command, but slurm shows > ass.log).

NOTE: In the 20180218 run, I have excluded transferring the core dump file due to its crazy size:

Here’s the error log generated by SOAPdenovo2 in the 20180218 run (the last line is all you really need to see, though):

Version 2.04: released on July 13th, 2012
Compile May 10 2017 12:50:52

********************
Pregraph
********************

Parameters: pregraph -s /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/soap_config -K 117 -p 24 -o /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/

In /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/soap_config, 1 lib(s), maximum read length 150, maximum name length 256.

/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R2_001_val_2_val_2.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R2_001_val_2_val_2.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R2_001_val_2_val_2.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R2_001_val_2_val_2.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R2_001_val_2_val_2.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R1_001_val_1_val_1.fq.gz
/gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R2_001_val_2_val_2.fq.gz
-- Out of memory --



I guess I’ll explore some other options for assembling these? I’m having a difficult time accepting that 500GB of RAM is insufficient, but that seems to be the case. Ouch.

# NovaSeq Assembly – Trimmed Geoduck NovaSeq with Meraculous

Attempted to use Meraculous to assemble the trimmed geoduck NovaSeq data.

Here’s the Meraculous manual (PDF).

After a bunch of various issues (running out of hard drive space – multiple times, config file issues, typos), I’ve finally given up on running meraculous. It failed, again, saying it couldn’t find a file in a directory that meraculous created! I’ve emailed the authors and if they have an easy fix, I’ll implement it and see what happens.

Anyway, it’s all documented in the Jupyter Notebook below.

One good thing came out of all of it is that I had to run kmergenie to identify an appopriate kmer size to use for assembly, as well as estimated genome size (this info is needed for both meraculous and SOAPdeNovo (which I’ll be trying next)):

kmergenie output folder: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180206_kmergenie/
kmergenie HTML report (doesn’t display histograms for some reason): 20180206_kmergenie/histograms_report.html
kmer size: 117
Est. genome size: 2.17Gbp

# Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

We would like to get an assembly of the geoduck NovaSeq data that Illumina provided us with.

Steven previously ran the raw data through FASTQC and there was a significant amount of adapter contamination (up to 44% in some libraries) present (see his FASTQC report here).

So, I trimmed them using TrimGalore and re-ran FASTQC on them.

This required two rounds of trimming using the “auto-detect” feature of Trim Galore.

• Round 1: remove NovaSeq adapters
• Round 2: remove standard Illumina adapters

See Jupyter notebook below for the gritty details.

##### Results:

All data for this NovaSeq assembly project can be found here: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/.

Round 1 Trim Galore reports: [20180125_trim_galore_reports/](http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180125_trim_galore_reports/]
Round 1 FASTQC: 20180129_trimmed_multiqc_fastqc_01
Round 1 FASTQC MultiQC overview: 20180129_trimmed_multiqc_fastqc_01/multiqc_report.html

Round 2 Trim Galore reports: 20180125_geoduck_novaseq/20180205_trim_galore_reports/
Round 2 FASTQC: 20180205_trimmed_fastqc_02/
Round 2 FASTQC MultiQC overview: 20180205_trimmed_multiqc_fastqc_02/multiqc_report.html

For the astute observer, you might notice the “Per Base Sequence Content” generates a “Fail” warning for all samples. Per the FASTQC help, this is likely expected (due to the fact that NovaSeq libraries are prepared using transposases) and doesn’t have any downstream impacts on analyses.

# Data Management – Illumina Geoduck HiSeq & MiSeq Data

The HDD we received from Illumina last week only had data (i.e. fastq files) from the NovaSeq runs they performed – nothing from either the MiSeq, nor the HiSeq runs.

We contacted them about the missing data, they confirmed it was missing, and uploaded the remaining data to BaseSpace.

Files will be temporarily stored in these locations:

/volume1/web/nightingales/Geoduck_MiSeq/170317_M03814_0172_000000000-B2K79/Data/GeoDuckRNAMiSeq-35978947

/volume1/web/nightingales/Geoduck_HiSeq/170228_ST-K00104_0382_BHHGTLBBXX/Data/Ironman-35682656

/volume1/web/nightingales/Geoduck_HiSeq/170228_ST-K00104_0381_AHHHWNBBXX/Data/Ironman-35682656

# Data Received – Geoduck Genome Sequencing by Illumina

We previously sent some geoduck samples to Illumina, as part of a pilot project for them to test out a new sequencing platform. The data has finally arrived!

It was sent on a 4TB Seagate external hard drive.

Due to weird connection issues we’ve recently encountered with our server, Owl (Synology DS1812+), I connected the HDD directly to Owl via USB (instead of connecting to a computer and transferring). I transferred the data using the Synology web interface to avoid any computer/NAS connection issues that might interrupt the transfer.

We have a meeting with the Illumina people tomorrow afternoon to review the data they’ve provided (looks like it’s going to take awhile, though). Once that meeting takes place, we’ll figure out how to document this project in our data management plan.

Jay received notice from UC Berkeley that the sequencing data from his coral RADseq was ready. In addition, the sequencing contains some epiRADseq data from samples provided by Hollie Putnam. See his notebook for multiple links that describe library preparation (indexing and barcodes), sample pooling, and species breakdown.

I’ve downloaded both the demultiplexed and non-demultiplexed data, verified data integrity by generating and comparing MD5 checksums, copied the files to each of the three species folders on owl/nightingales that were sequenced (Panopea generosa, Anthopleura elegantissima, Porites astreoides), generated and compared MD5 checksums for the files in their directories on owl/nightingales, and created/updated the readme files in each respective folder.

Data management is detailed in the Jupyter notebook below. The notebook is embedded in this post, but it may be easier to view on GitHub (linked below).

Readme files were updated outside of the notebook.

Jupyter notebook (GitHub): 20170227_docker_jay_ngs_data_retrieval.ipynb