Tag Archives: mox

Mox – Olympia oyster genome annotation progress (using Maker 2.31.10)

TL;DR – It appears to be continuing where it left off!

I decided to spend some time to figure out what was actually happening, as it’s clear that the annotation process is going to need some additional time to run and may span an additional monthly maintenance shutdown.

This is great, because, otherwise, this will take an eternity to actually complete (particularly because we’d have to move the job to run on one of our lab’s computers – which pale in comparison to the specs of our Mox nodes).

However, it’s a bit shocking that this is taking this long, even on a Mox node!

I started annotating the Olympia oyster genome on 20180529. Since then, the job has been interrupted twice by monthly Mox maintenance (which happens on the 2nd Tuesday of each month). Additionally, when this happens, the SLURM output file is overwritten, making it difficult to assess whether or not Maker continues where it left off or if it’s starting over from scratch.

Anyway, here’s how I deduced that the program is continuing where it left off.

  1. I figured out that it produces a generic feature format (GFF) file for each contig.

  2. Decided to search for the first contig GFF and look at it’s last modified date. This would tell me if it was newly generated (i.e. on the date that the job was restarted after the maintenance shutdown) or if it was old. Additionally, if there were more than one of these files, then I’d also know that Maker was just starting at the beginning and writing data to a different location.

    This shows:

    1. Only one copy of Contig0.gff exists.

    2. Last modified date is 20180530.

  3. Check the slurm output file for info.

    This reveals this important piece of info:

    MAKER WARNING: The file 20180529_oly_annotation_01.maker.output/20180529_oly_annotation_01_datastore/AC/68/Contig215522//theVoid.Contig215522/0/Contig215522.0.all.rb.out
    did not finish on the last run

All of these taken together lead me to confidently conclude that Maker is not restarting from the beginning and is, indeed, continuing where it left off. WHEW!

Just for kicks, I also ran a count of GFF files to see where this stands so far:

Wow! 622,010 GFFs!!!

Finally, for posterity, here’s the SLURM script I used to submit this job, back in May! I’ll have all of the corresponding genome files, proteome files, transcriptome files, etc. on one of our servers once the job completes.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180529_oly_maker_genome_annotation
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/sam/outputs/20180529_oly_maker_genome_annotation

## Establish variables for more readable code

### Path to Maker executable
maker=/gscratch/srlab/programs/maker-2.31.10/bin/maker

### Path to Olympia oyster genome FastA file
oly_genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/jelly.out.fasta

### Path to Olympia oyster transcriptome FastA file
oly_transcriptome=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta

### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa

### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa

## Create Maker control files needed for running Maker
$maker -CTL

## Store path to options control file
maker_opts_file=./maker_opts.ctl

## Create combined proteome FastA file
touch gigas_virginica_ncbi_proteomes.fasta
cat "$gigas_proteome" >> gigas_virginica_ncbi_proteomes.fasta
cat "$virginica_proteome" >> gigas_virginica_ncbi_proteomes.fasta

## Edit options file

### Set paths to O.lurida genome and transcriptome.
### Set path to combined C. gigas and C.virginica proteomes.
## The use of the % symbol sets the delimiter sed uses for arguments.
## Normally, the delimiter that most examples use is a slash "/".
## But, we need to expand the variables into a full path with slashes, which screws up sed.
## Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^genome=/ s% %$oly_genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$oly_transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$gigas_virginica_ncbi_proteomes %" "$maker_opts_file"

## Run Maker
### Set basename of files and specify number of CPUs to use
$maker 
-base 20180529_oly_annotation_01 
-cpus 24
Share

Assembly & Stats – SparseAssembler (k95) on Geoduck Sequence Data > Quast for Stats

Had a successful assembly with SparseAssembler k101, but figured I’d just tweak the kmer setting and throw it in the queue and see how it compares; minimal effort/time needed.

Initiatied an assembly run using SparseAssembler on our Mox HPC node on all of our geoduck genomic sequencing data:

Kmer size set to 95.

Slurm script: 20180423_sparse_assembler_kmer95_geoduck_slurm.sh

After the run finished, I copied the files to our server (Owl) and then ran Quast on my computer to gather some assembly stats, using the following command:


/home/sam/software/quast-4.5/quast.py 
-t 24 
--labels 20180423_sparse_k95 
/mnt/owl/Athaliana/20180423_sparseassembler_kmer95_geoduck/Contigs.txt 

Results:

SparseAssembler output folder: 20180423_sparseassembler_kmer95_geoduck/

SparseAsembler assembley (FastA; 15GB): 20180423_sparseassembler_kmer95_geoduck/Contigs.txt

Quast output folder: quast_results/results_2018_05_10_15_04_07

Quast report (HTML): quast_results/results_2018_05_10_15_04_07/report.html

I’ve embedded the Quast HTML report below, but it may be easier to view by using the link above.

Well, it’s remarkable how different this is than the previous SparseAssembler with k101 setting!

This assembly doesn’t have a single contig >50,000bp, while the previous one has four contigs over that threshold!

Definitely shows what a large impact the kmer setting in assembly software can have on the final assembly!

Share

Assembly – SparseAssembler (k 111) on Geoduck Sequence Data

Continuing to try to find the best kmer setting to work with SparseAssemlber after the last attempt failed due to a kmer size that was too large (k 131; which happens to be outside the max kmer size [127] for SparseAssembler), I re-ran SparseAssembler with an arbitrarily selected kmer size < 131 (picked k 111).

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Well, this is disappointing. Not entirely sure why this is the case, as it’s below the max kmer setting for SparseAssembler. However, I’m not terribly surprised, as this happened previously (only using NovaSeq data) with a kmer setting of 117.

I’ve posted an issue on the kmergenie GitHub page; we’ll see what happens.

Share

Assembly – SparseAssembler (k 131) on Geoduck Sequence Data

After some runs with kmergenie, I’ve decided to try re-running SparseAssembler using a kmer setting of 131.

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Looking into this, it’s because the maximum kmer size for kmergenie is 127! Doh!

It’d be nice if the program looked at that setting first before processign all the data files…

A bit disappointing, but I’ll give this a go with a lower kmer setting and see how it goes.

Share

Kmer Estimation – Kmergenie (k 301) on Geoduck Sequence Data

Continuing the quest for the ideal kmer size to use for our geoduck assembly.

The previous two runs with kmergenie using the diploid setting were no good.

So, this time, I simply increased the maximum kmer size to 301 and left all other settings as default. I’m hoping this is large enough to produce a smooth curve, with a maximal value that can be determined from the output graph.

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

Kmer histogram (HTML) reports:

Well, the graph is closer to what we’d expect, in that it appears to reach a zenith, but after that plateau, we see a sharp dropoff, as opposed to a gradual dropoff that mirrors the left half. Not entirely sure what the implications for this are, but I’ll go ahead an run SparseAssembler using a kmer size of 131 and see how it goes.

Share

Kmer Estimation – Kmergenie Tweaks on Geoduck Sequence Data

Earlier today, I ran kmergenie on our all of geoduck DNA sequencing data to see what it would spit out for an ideal kmer setting, which I would then use in another assembly attempt using SparseAssembler; just to see how the assembly might change.

The output from that kmergenie run suggested that the ideal kmer size exceeded the default maximum (k = 121), so I decided to run kmergenie a few more times, with some slight changes.

All jobs were run on our Mox HPC node.

Run 1
Run 2
Results:

Output folders:

Slurm output files:

Kmer histogram (HTML) reports:

Diploid





Diploid, k 301

Okay, well, these graphs clearly show that the diploid setting is no good.

We should be getting a nice, smooth, concave curve.

Will try running again, without diploid setting and just increasing the max kmer size.

Share

Kmer Estimation – Kmergenie on Geoduck Sequence Data (default settings)

After the last SparseAssembler assembly completed, I wanted to do another run with a different kmer size (last time was arbitrarily set at 101). However, I didn’t really know how to decide, particularly since this assembly consisted of mixed read lenghts (50bp and 100bp). So, I ran kmergenie on all of our geoduck (Panopea generosa) sequencing data in hopes of getting a kmer determination to apply to my next assembly.

The job was run on our Mox HPC node.

Slurm script: 20180419_kmergenie_geoduck_slurm.sh

Input files list (needed for kmergenie command – see Slurm script linked above): geoduck_fastq_list.txt

Results:

Output folder: 20180419_kmergenie_geoduck/

Slurm output file: 20180419_kmergenie_geoduck/slurm-161551.out

Kmer histograms (HTML): 20180419_kmergenie_geoduck/histograms_report.html

Screen cap from Kmer report:

This data estimates the best kmer size for this data to be 121.

However, based on the kmergenie documentation, this is likely to be inaccurate. This inaccuracy is based on the fact that our kmer graph should be concave. Our graph, instead, is only partial – we haven’t reached a kmer size where the number of kmers is decreasing.

As such, I’ll try re-running with a different maximum kmer settting (default max is 121).

Share

Genome Assembly – SparseAssembler Geoduck Genomic Data, kmer=101

UPDATE 20180413

Assembly complete. See end of post for data locations.


UPDATE 20180410

Received a status update email:

SLURM Job_id=156637 Name=20180405_sparse_assembler_kmer101_geo Ended, Run time 4-20:17:08, CANCELLED, ExitCode 0

After talking to Steven, it turns out Mox was taken offline for maintenance, which killed all jobs (and access). Ugh.

Will restart tonight once Mox is back online.


OK, here we go! Initiatied an assembly run using SparseAssembler on our Mox HPC node on all of our geoduck genomic sequencing data:

Kmer size set to 101.

This is 118 files of sequencing data!! Fingers crossed…

Slurm script: 20180405_sparse_assembler_kmer101_geoduck_slurm.sh


#!/bin/bash
## Job Name
#SBATCH --job-name=20180405_sparse_assembler_kmer101_geo
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180405_sparseassembler_kmer101_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler 
LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 101 
g 15 
PathCovTh 100 
GS 2200000000 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/AD002_S9_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR005_S4_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR006_S3_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR012_S1_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR013_AD013_S2_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR014_AD014_S5_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR015_AD015_S6_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR019_S7_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L001_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R1_001_val_1_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/20180129_trimgalore_geoduck_novaseq/NR021_S8_L002_R2_001_val_2_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fastq 
i1 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fastq 
i2 /gscratch/scrubbed/samwhite/bgi_geoduck/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-1_S1_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2_S5_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-1_S2_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-2_S6_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-3_S10_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-4_S14_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-5_S18_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-6_S22_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-7_S26_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-2to4kb-8_S30_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-3_S9_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-4_S13_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5_S17_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-1_S3_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-2_S7_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-3_S11_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-4_S15_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-5_S19_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-6_S23_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-7_S27_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-5to7kb-8_S31_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-6_S21_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-7_S25_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8_S29_L008_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-1_S4_L001_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-2_S8_L002_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-3_S12_L003_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-4_S16_L004_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-5_S20_L005_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-6_S24_L006_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-7_S28_L007_R2_001_val_2.fastq 
i1 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R1_001_val_1.fastq 
i2 /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/Geoduck-NMP-gDNA-8to10kb-8_S32_L008_R2_001_val_2.fastq
Results:

Output folder: 20180405_sparseassembler_kmer101_geoduck/

Slurm output files:

SparseAssembler Assembly (FASTA): Contigs.txt

Added this to the GitHub wiki for our geoduck genome assemblies.

Share

Gunzip – BGI HiSeq Geoduck Genome Sequencing Data

In preparation to run SpareAssembler, I needed to gunzip the BGI gzipped FASTQ files from 20180327.

Ran the following slurm script on our Mox node:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180405_geoduck_bgi_gunzip
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/bgi_geoduck

for i in /gscratch/scrubbed/samwhite/bgi_geoduck/*.gz; do
    filename="${i##*/}"
    no_ext="${filename%%.*}"
    gunzip < "$i" > "$no_ext".fastq
done
Results:

Completed in ~45mins. Will proceed with massive geoduck genome assembly!

Share

Gunzip – Trimmed Illumina Geoduck HiSeq Genome Sequencing Data

In preparation to run SpareAssembler, I needed to gunzip the trimmed gzipped FASTQ files from 20140401.

Ran the following slurm script on our Mox node:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180404_geoduck_gunzip
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck

for i in /gscratch/scrubbed/samwhite/illumina_geoduck_hiseq/20180328_trim_galore_illumina_hiseq_geoduck/*.gz; do
    filename="${i##*/}"
    no_ext="${filename%%.*}"
    gunzip < "$i" > "$no_ext".fastq
done
Results:

This crashed shortly after initiating the run (~30mins later). Received following email notification:

SLURM Job_id=155940 Name=20180404_geoduck_gunzip Failed, Run time 00:30:40, NODE_FAIL

It did not generate a slurm output file, nor any gunzipped files. Will contact UW IT…

UPDATE 20140404

Weird, about an hour after this crashed, I received the following email, indicating the job was submitted (I did no resubmit, btw):

SLURM Job_id=155940 Name=20180404_geoduck_gunzip Began, Queued time 00:02:29

Completed about 3hrs later.

Share