# Genome Assembly – SparseAssembler Geoduck Genomic Data, kmer=101

##### UPDATE 20180413

Assembly complete. See end of post for data locations.

##### UPDATE 20180410

SLURM Job_id=156637 Name=20180405_sparse_assembler_kmer101_geo Ended, Run time 4-20:17:08, CANCELLED, ExitCode 0

After talking to Steven, it turns out Mox was taken offline for maintenance, which killed all jobs (and access). Ugh.

Will restart tonight once Mox is back online.

OK, here we go! Initiatied an assembly run using SparseAssembler on our Mox HPC node on all of our geoduck genomic sequencing data:

Kmer size set to 101.

This is 118 files of sequencing data!! Fingers crossed…

Slurm script: 20180405_sparse_assembler_kmer101_geoduck_slurm.sh


#!/bin/bash
## Job Name
#SBATCH --job-name=20180405_sparse_assembler_kmer101_geo
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180405_sparseassembler_kmer101_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler
LD 0
NodeCovTh 1
EdgeCovTh 0
k 101
g 15
PathCovTh 100
GS 2200000000
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R1_001_val_1_val_1.fastq
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R2_001_val_2_val_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fastq
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fastq
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R2_001_val_2.fastq
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R1_001_val_1.fastq
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R2_001_val_2.fastq

##### Results:

Output folder: 20180405_sparseassembler_kmer101_geoduck/

Slurm output files:

SparseAssembler Assembly (FASTA): Contigs.txt

Added this to the GitHub wiki for our geoduck genome assemblies.

# Gunzip – BGI HiSeq Geoduck Genome Sequencing Data

In preparation to run SpareAssembler, I needed to gunzip the BGI gzipped FASTQ files from 20180327.

Ran the following slurm script on our Mox node:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180405_geoduck_bgi_gunzip
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/bgi_geoduck

for i in /gscratch/scrubbed/samwhite/bgi_geoduck/*.gz; do
filename="${i##*/}" no_ext="${filename%%.*}"
gunzip < "$i" > "$no_ext".fastq
done

##### Results:

Completed in ~45mins. Will proceed with massive geoduck genome assembly!

# FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

I transferred our BGI geoduck FASTQ files to our Mox node (/gscratch/scrubbed/samwhite/bgi_geoduck/).

I ran FASTQC on them to actually check them out and see if they needed any trimming, as I don’t believe this has been done!

FASTQC slurm script: 20180327_fastqc_bgi_geoduck_slurm.sh

Side note: Initial FASTQC failed on one file. Turns out, it got corrupted during transfer! Serves as good reminder about the importance of verifying MD5 checksums after file transfer, prior to attempting to work with files!

This was followed up with MultiQC (run locally from my computer on the files hosted on Owl). This was performed the following day (20180328).

##### Results:

FASTQC output: 20180327_bgi_fastqc

MultiQC output: 20180328_bgi_multiqc

MultiQC HTML report: 20180328_bgi_multiqc/multiqc_report.html

Everything looks nice and clean! Waiting on transfer and FASTQC of Illumina NMP data before proceeding to next assembly attempt.

# Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

Yesterday, I ran PB Jelly using Sean’s Platanus assembly, but that didn’t produce an assembly because PB Jelly was expecting gaps in the Illumina reference assembly (i.e. scaffolds, not contigs).

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

#### Results:

OK! This seems to have worked (and it was quick, like less than an hour!), as it actually produced a FASTA file! Will run QUAST with this and some assemblies to compare assembly stats. Have added this assembly to our Olympia oyster genome assemblies table.

Jupyter Notebook (GitHub): 20171114_emu_pbjelly_BGI_scaffold.ipynb

# Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

• Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
• Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
• I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
• Notebook entry

Canu

Redundans

Platanus

# Data Management – SRA Submission Oly GBS Batch Submission

An earlier attempt at submitting these files failed.

I re-uploaded the failed files (indicated in my previous notebook entry linked above) and tried again.

It failed again, despite having successfully uploaded just minutes before.

I re-uploaded that “missing” file and tried again.

This time, it succeeded (and no end-of-stream error for the 1SN_1A file!)!

Will post here with the SRA accession number once it goes live!

# Computing – Oly BGI GBS Reproducibility; fail?

OK, so things have improved since the last attempt at getting this BGI script to run and demultiplex the raw data.

I played around with the index.lst file format (based on the error I received last time, it seemed like a good possibility that the file formatting was incorrect) and actually got the script to run to completion! Granted, it took over 16hrs (!!), but it completed!

See the Jupyter notebook link below.

Results:

Well, although the script finished and kicked out all the demultiplexed FASTQ files, the contents of the FASTQ files don’t match (the read counts differ between these results and the BGI files) the original set of demultiplexed files. I’m not entirely sure if this is to be expected or not, since the script allows for a single nucleotide mismatch when demultiplexing. Is it possible that the mismatch could be interpreted slightly differently each time this is run? I’m not certain.

Theoretically, you should get the same results every time…

Maybe I’ll re-run this again over the weekend and see how the results compare to this run and the original BGI demultiplexing…

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

Jupyter notebook (may be easier to view in GitHub link above):

# Computing – Oly BGI GBS Reproducibility Fail (but, less so than last time)…

Well, my previous attempt at reproducing the demultiplexing that BGI performed was an exercise in futility. BGI got back to me with the following message:

Hi Sam,

We downloaded it and it seems fine when compiling. You can compile it with the below command under Linux system.

tar -zxvf ReSeqTools_XXX.tar.gz ; cd iTools_Code; chmod 775 iTools ; ./ iTools -h

I gave that whirl and got the following message:

Error opening terminal: xterm

Some internet searching got me sucked into a useless black hole about 64 bit systems running 32 bit programs and enabling the 64 bit kernel on Mac OS X 10.7.5 (Lion) since it’s not enabled by default and on and on. In the end, I can’t seem to enable the 64 bit kernel on my Mac Pro, likely due to hardware limitations related to the graphics card and/or displays that are connected.

Anyway, I decided to try getting this program installed again, using a Docker container (instead of trying to install locally on my Mac).

Results:

It didn’t work again, but for a different reason! Despite the instructions in the readme file provided with iTools, you don’t actually need to run make! All that has to be done is unzipping the tarball!! However, despite figuring this out, the program fails with the following error message: “Warming : sample double in this INDEX Files. Sample ID: OYSzenG1AAD96FAAPEI-109; please renamed it diff” (note: this is copied/pasted – the spelling errors are note mine). So, I think there’s something wrong with the formatting of the index file that BGI provided me with.

See the Jupyter notebook linked below to see what I tried.

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

# Computing – Oly BGI GBS Reproducibility Fail

Since we’re preparing a manuscript that relies on BGI’s manipulation/handling of the genotype-by-sequencing data, I attempted to could reproduce the demultiplexing steps that BGI used in order to perform the SNP/genotyping on these samples.

The key word in the above sentence is “attempted.” Ugh, what a massive waste of time it turned out to be. I’ve contacted BGI to get some help on this.

In the meantime, here’s a brief (actually, not as brief as I’d like) rundown of my struggles.

The demultiplexing software that BGI used is something called “iTools” which is bundled in this GitHub repo: Resqtools

To demutliplex, they ran a script called: split.sh

The script seems fairly straightforward. Here is what it contains:

iTools Fqtools splitpool
-Index index.lst
-Flag enzyme.txt
-MisMatch
-OutDir split


It tells the iTools program to use the Fqtools tool “splitpool” to operate on a pair of gzipped FASTQ files. It also utilizes an index file (index.lst) which contains all the barcodes needed to identify, and separate, the individual samples that were combined prior to sequencing.

The first bump in the road is the -Flag enzyme.txt portion of the code. BGI did not provide me with this file. I recently requested them to send me it (or its contents, since I suspected it was only a single line text file). They sent me the contents of the file:

CAGC
CTGC

The next problem is neither of those two sequences are the recognition site for the enzyme that was (supposedly) used: ApeKI. The recognition site for ApeKI is: GCWGC

Regardless, I decided to see if I could reproduce the demultiplexing using the info they’d provided me.

I cloned the Resqtools repo, changed into the Reseqtools/iTools directory and typed make.

This resulted in an error informing me that it could not find boost/spirit/core.hpp

Tried to run make in the Reseqtools/iTools directory and got the same error. Realized iTools might not be searching the system \$PATH (this turned out to be correct), so I moved the contents of the Boost folder to the iTools, ran make and got the same error. Turns out, the newest version of Boost doesn’t have that core.hpp file any more. Looking at the iTools documentation, iTools was built around Boost 1.44. OMG…

Downloaded Boost 1.44 and went through the same steps as above. This eliminated the missing core.hpp error!

But, of course, led to another error. The error:

"Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS"

That was related to something with newer versions of the GCC compiler (this is, essentially, built into the computer; it’s not worth trying to install/use old versions of GCC) trying to work with old versions of Boost. Found a patch for a config file here: libstdcpp3.hpp.patch

I made the appropriate edits to the file as shown in that link and ran make and it almost worked!

The current error is:

./src/Variants/soapsv-v1.02/include.h:15:16: fatal error: gd.h: No such file or directory

I gave up and contacted BGI to see if they can get me a functional version of iTools…

# FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb