Tag Archives: NovaSeq

Data Management – Illumina NovaSeq Geoduck Genome Sequencing

As part of the Illumina collaborative geoduck genome sequencing project, their end goal has always been to sequence the genome in a single run.

They’ve finally attempted this by running 10x Genomics, Hi-C, Nextera, and TruSeq libraries in a single run of the NovaSeq.

I downloaded the data using the BaseSpace downloader using Chrome on a Windows 7 computer (this is not available on Ubuntu and the command line tools that are available from Illumina are too confusing for me to bother spending the time on to figure out how to use them just to download the data).

Data was saved here:

Generated MD5 checksums (using md5sum on Ubuntu) and appended to the checksums file:

Illumina was unable to provide MD5 checksums on their end, so I was unable to confirm data integrity post-download.

Illumina sample info is here:

Will add info to:

List of files received:

10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R2_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L001_R1_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L001_R2_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L002_R1_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R2_001.fastq.gz
Share

Genome Assembly – SparseAssembler Geoduck Genomic Data, kmer=101

UPDATE 20180413

Assembly complete. See end of post for data locations.


UPDATE 20180410

Received a status update email:

SLURM Job_id=156637 Name=20180405_sparse_assembler_kmer101_geo Ended, Run time 4-20:17:08, CANCELLED, ExitCode 0

After talking to Steven, it turns out Mox was taken offline for maintenance, which killed all jobs (and access). Ugh.

Will restart tonight once Mox is back online.


OK, here we go! Initiatied an assembly run using SparseAssembler on our Mox HPC node on all of our geoduck genomic sequencing data:

Kmer size set to 101.

This is 118 files of sequencing data!! Fingers crossed…

Slurm script: 20180405_sparse_assembler_kmer101_geoduck_slurm.sh


#!/bin/bash
## Job Name
#SBATCH --job-name=20180405_sparse_assembler_kmer101_geo
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180405_sparseassembler_kmer101_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler 
LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 101 
g 15 
PathCovTh 100 
GS 2200000000 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R2_001_val_2.fastq
Results:

Output folder: 20180405_sparseassembler_kmer101_geoduck/

Slurm output files:

SparseAssembler Assembly (FASTA): Contigs.txt

Added this to the GitHub wiki for our geoduck genome assemblies.

Share

Assembly – Geoduck NovaSeq using SparseAssembler kmer = 101

The prior run used a kmer size of 61, and the resulting assembly was rather poor (small N50).

For this run, I arbitrarily increased the kmer size to 101, in hopes that this will improve the assembly.

The job was run on our Mox node.

Here’s the batch script to initiate the job:

20180322_SparseAssembler_novaseq_geoduck_slurm.sh


#SBATCH --job-name=20180322_sparse_assembler_geo_novaseq
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1   
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180322_SparseAssembler_novaseq_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler 
LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 101 
g 15 
PathCovTh 100 
GS 2200000000 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R2_001_val_2_val_2.fastq
Results

Output folder: 20180322_SparseAssembler_novaseq_geoduck/

This completed much more quickly than the previous run (kmer = 61). The previous assembly took ~10 days, while this assembly completed in ~4 days!

The primary output file of interest is this FASTA file:

In order to get a rough idea of how this assembly looks, I ran it through Quast Version: 4.5, 15ca3b9:

python software/quast-4.5/quast.py
-t 16
/mnt/owl/Athaliana/20180322_SparseAssembler_novaseq_geoduck/Contigs.txt

Quast output folder: results_2018_03_27_08_25_52/

Here’re the stats on the assembly:

Quast output (text): results_2018_03_27_08_25_52/report.txt

Quast output (HTML):results_2018_03_27_08_25_52/report.html

This is definitely a better assembly than the kmer = 61 assembly.

N50 = 1149

Also, there’s a single, large contig of 56,361bp, and 54 contigs > 25,000bp. This is good.

Admittedly, I’m a little surprised (and, disappointed) the N50 is as small as it is. However, we have a pretty decent assembly on our hands!

Since SparseAssembler seems to actually run (and, relatively quickly), I’m very tempted to just throw ALL of our geoduck data at it and see how it turns out…

Share

Assembly – Geoduck NovaSeq using SparseAssembler (TL;DR – it worked!)

The prior attempt using SparseAssembler failed due to a kmer size that was deemed too large.

For this run, I arbitrarily reduced the kmer size by ~half (k 61) in hopes that this will just get through an assembly. We can potentially explore the effects of kmer size on assemblies if/when this runs and depending no how the assembly looks.

The job was run on our Mox node.

Here’s the batch script to initiate the job:

#!/bin/bash
## Job Name
#SBATCH --job-name=20180313_sparse_assembler_geo_novaseq
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180312_SparseAssembler_novaseq_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler 
LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 61 
g 15 
PathCovTh 100 
GS 2200000000 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R2_001_val_2_val_2.fastq
Results

Output folder: 20180312_SparseAssembler_novaseq_geoduck

IT WORKED!!! At last; we have an assembly of the geoduck NovaSeq data!! It took ~10days to complete.

The primary output file of interest is this FASTA file:

In order to get a rough idea of how this assembly looks, I ran it through Quast Version: 4.5, 15ca3b9:

python software/quast-4.5/quast.py
-t 16
/mnt/owl/Athaliana/20180312_SparseAssembler_novaseq_geoduck/Contigs.txt

Quast output folder: results_2018_03_22_08_12_12

Here’re the stats on the assembly:

Quast output (text): results_2018_03_22_08_12_12/report.txt

Quast output (HTML):results_2018_03_22_08_12_12/report.html

Overall, the assembly doesn’t look great. The N50 = 645 is really, really low. One would hope for a much large number for a quality assembly. As it stands, this assembly is comprised of many small contigs.

Looks like we’ll have to fiddle with the kmer size used for SparseAssembler and see if we can improve upon this.

Despite that, it’s an accomplishment to finally get any sort of assembler to run to completion for this data set!

Share

Assembly – Geoduck NovaSeq using SparseAssembler (failed)

Steven came across a 2012 paper in BMC Bioinformatics (“Exploiting sparseness in de novo genome assembly”) that utilized an assembly program we hadn’t previously encountered: SparseAssembler

This software is intended to greatly reduce the required amount of RAM necessary to process very large assembly data sets. As I previously learned, RAM is a limiting factor for assembly programs, and the install (if you can even call it that) was simply upacking a zip file (program installations on Mox are not trivialso this seems like it has promise!

The job was run on our Mox node.

Here’s the batch script to initiate the job:

20180308_soap_novaseq_geoduck_slurm.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20180308_sparse_assembler_geo_novaseq
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1   
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180308_SparseAssembler_novaseq_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler 
LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 117 
g 15 
PathCovTh 100 
GS 2200000000 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR014_AD014_S5_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR015_AD015_S6_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR019_S7_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR021_S8_L002_R2_001_val_2_val_2.fastq
Results

Output folder: 20180308_SparseAssembler_novaseq_geoduck/

Well, this failed, but not because of memory issues (which is a good start)!

Instead, it failed because the kmer size was too large??!!

See the slurm output log file:

Kmergenie had indicated a kmer size of 117bp.

Will reduce kmer size and try again. Fingers crossed…

Share

NovaSeq Assembly – The Struggle is Real – Real Annoying!

Well, I continue to struggle to makek progress on assembling the geoduck Illumina NovaSeq data. Granted, there is a ton of data (374GB!!!!), but it’s still frustrating that we can’t get an assembly anywhere…

Here are some of the struggles so far:

Meraculous:

SOAPdenovo2

JR-Assembler

  • Can’t install one of the dependencies (SOAP error correction)
  • Actually, I need to try the binary version of this, instead of the source version (the source version fails at the make step)

So, next up will trying the following two assemblers:

  • JR-Assembler: Will see if SOAPec binary will work, and then run an assembly.
  • AllPaths-LG: I was able to install this successfully on Mox.

Additionally, we’ve ordered some additional hard drives and will be converting the old head/master node on the Apple Xserve cluster to Linux. The old master node is a little better equipped than the other Apple Xserve “birds”, so will try to re-run Meraculous on it once we get it converted.

Share

Assembly – Geoduck Illumina NovaSeq SOAPdenovo2 on Mox (FAIL)

Trying to get the NovaSeq data assembled using SOAPdenovo2 on the Mox HPC node we have and it will not work.

Tried a couple of times and it hasn’t run successfully. Here are links to the files used on Mox (including the batch script and slurm output files). I made slight changes to the formatting of the batch script because I thought there was something wrong. Specifically, the slurm output file in the 20180215 runs does not accurately reflect the command I issued (i.e. 1> ass.log is command, but slurm shows > ass.log).

NOTE: In the 20180218 run, I have excluded transferring the core dump file due to its crazy size:

Here’s the error log generated by SOAPdenovo2 in the 20180218 run (the last line is all you really need to see, though):

Version 2.04: released on July 13th, 2012
Compile May 10 2017 12:50:52

********************
Pregraph
********************

Parameters: pregraph -s /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/soap_config -K 117 -p 24 -o /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/ 

In /gscratch/scrubbed/samwhite/20180218_soapdenovo2_novaseq_geoduck/soap_config, 1 lib(s), maximum read length 150, maximum name length 256.

24 thread(s) initialized.
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L001_R2_001_val_2_val_2.fq.gz
--- 100000000th reads.
--- 200000000th reads.
--- 300000000th reads.
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/AD002_S9_L002_R2_001_val_2_val_2.fq.gz
--- 400000000th reads.
--- 500000000th reads.
--- 600000000th reads.
--- 700000000th reads.
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L001_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR005_S4_L002_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L001_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR006_S3_L002_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L001_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR012_S1_L002_R2_001_val_2_val_2.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R1_001_val_1_val_1.fq.gz
Import reads from file:
 /gscratch/scrubbed/samwhite/20180129_trimmed_again/NR013_AD013_S2_L001_R2_001_val_2_val_2.fq.gz
--- 800000000th reads.
--- 900000000th reads.
-- Out of memory --

I guess I’ll explore some other options for assembling these? I’m having a difficult time accepting that 500GB of RAM is insufficient, but that seems to be the case. Ouch.

Share

NovaSeq Assembly – Trimmed Geoduck NovaSeq with Meraculous

Attempted to use Meraculous to assemble the trimmed geoduck NovaSeq data.

Here’s the Meraculous manual (PDF).

After a bunch of various issues (running out of hard drive space – multiple times, config file issues, typos), I’ve finally given up on running meraculous. It failed, again, saying it couldn’t find a file in a directory that meraculous created! I’ve emailed the authors and if they have an easy fix, I’ll implement it and see what happens.

Anyway, it’s all documented in the Jupyter Notebook below.

One good thing came out of all of it is that I had to run kmergenie to identify an appopriate kmer size to use for assembly, as well as estimated genome size (this info is needed for both meraculous and SOAPdeNovo (which I’ll be trying next)):

kmergenie output folder: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180206_kmergenie/
kmergenie HTML report (doesn’t display histograms for some reason): 20180206_kmergenie/histograms_report.html
kmer size: 117
Est. genome size: 2.17Gbp

Jupyter Notebook (GitHub): 20180205_roadrunner_meraculous_geoduck_novaseq.ipynb

Share

Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

We would like to get an assembly of the geoduck NovaSeq data that Illumina provided us with.

Steven previously ran the raw data through FASTQC and there was a significant amount of adapter contamination (up to 44% in some libraries) present (see his FASTQC report here).

So, I trimmed them using TrimGalore and re-ran FASTQC on them.

This required two rounds of trimming using the “auto-detect” feature of Trim Galore.

  • Round 1: remove NovaSeq adapters
  • Round 2: remove standard Illumina adapters

See Jupyter notebook below for the gritty details.

Results:

All data for this NovaSeq assembly project can be found here: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/.

Round 1 Trim Galore reports: [20180125_trim_galore_reports/](http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180125_trim_galore_reports/]
Round 1 FASTQC: 20180129_trimmed_multiqc_fastqc_01
Round 1 FASTQC MultiQC overview: 20180129_trimmed_multiqc_fastqc_01/multiqc_report.html


 
 
 
 
 
 

Round 2 Trim Galore reports: 20180125_geoduck_novaseq/20180205_trim_galore_reports/
Round 2 FASTQC: 20180205_trimmed_fastqc_02/
Round 2 FASTQC MultiQC overview: 20180205_trimmed_multiqc_fastqc_02/multiqc_report.html

 
 
 
 
 
 

For the astute observer, you might notice the “Per Base Sequence Content” generates a “Fail” warning for all samples. Per the FASTQC help, this is likely expected (due to the fact that NovaSeq libraries are prepared using transposases) and doesn’t have any downstream impacts on analyses.

 
 
 
 
 
 

Jupyter Notebook (GitHub): 20180125_roadrunner_trimming_geoduck_novaseq.ipynb

Share