Tag Archives: FASTQC

TrimGalore/FastQC/MultiQC – Trim 10bp 5’/3′ ends C.virginica MBD BS-seq FASTQ data

Steven found out that the Bismarck documentation (Bismarck is the bisulfite aligner we use in our BS-seq pipeline) suggests trimming 10bp from both the 5′ and 3′ ends. Since this is the next step in our pipeline, we figured we should probably just follow their recommendations!

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Hey! Look at that! Everything is much better! Thanks for the excellent documentation and suggestions, Bismarck!

Share

TrimGalore/FastQC/MultiQC – 2bp 3′ end Read 1s Trim C.virginica MBD BS-seq FASTQ data

Earlier today, I ran TrimGalore/FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch and hard trimmed the first 14bp from each read. Things looked better at the 5′ end, but the 3′ end of each of the READ1 seqs showed a wonky 2bp blip, so decided to trim that off.

I ran TrimGalore (using the built-in FastQC option), with a hard trim of the last 2bp of each first read set that had previously had the 14bp hard trim and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Well, this is a bit strange, but the 2bp trimming on the read 1s looks fine, but now the read 2s are weird in the same region!

Regardless, while this was running, Steven found out that the Bismarck documentation (Bismarck is the bisulfite aligner we use in our BS-seq pipeline) suggests trimming 10bp from both the 5′ and 3′ ends. So, maybe this was all moot. I’ll go ahead and re-run this following the Bismark recommendations.

Share

TrimGalore/FastQC/MultiQC – 14bp Trim C.virginica MBD BS-seq FASTQ data

Yesterday, I ran TrimGalore/FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch with the default settings (i.e. “auto-trim”). There was still some variability in the first ~15bp of the reads and Steven wanted to see how a hard trim would change things.

I ran TrimGalore (using the built-in FastQC option), with a hard trim of the first 14bp of each read and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

OK, this trimming definitely took care of the variability seen in the first ~15bp of all the reads.

However, I noticed that the last 2bp of each of the Read 1 seqs all have some wonky stuff going on. I’m guessing I should probably trim that stuff off, too…

Share

TrimGalore/FastQC/MultiQC – Auto-trim C.virginica MBD BS-seq FASTQ data

Yesterday, I ran FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch. Steven wanted to trim it and see how things turned out.

I ran TrimGalore (using the built-in FastQC option) and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer.

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Overall, the auto-trim didn’t alter things too much. Specifically, Steven is concerned about the variability in the first 15bp (seen in the Per Base Sequence Content section of the MultiQC output). It was reduced, but not greatly. Will perform an independent run of TrimGalore and employ a hard trim of the first 14bp of each read and see how that looks.

Share

FastQC/MultiQC – C. virginica MBD BS-seq Data

Per Steven’s GitHub Issues request, I ran FastQC on the Eastern oyster MBD bisulfite sequencing data we recently got back from ZymoResearch.

Ran FastQC locally with the following script: 20180409_fastqc_Cvirginica_MBD.sh


#!/bin/bash
/home/sam/software/FastQC/fastqc 
--threads 18 
--outdir /home/sam/20180409_fastqc_Cvirginica_MBD 
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R2.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R1.fastq.gz 
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R2.fastq.gz

MultiQC was then run on the FastQC output files.

All files were moved to Owl after the jobs completed.

Results:

FastQC Output folder: 20180409_fastqc_Cvirginica_MBD/

MultiQC Output folder: 20180409_fastqc_Cvirginica_MBD/multiqc_data/

MultiQC report (HTML): 20180409_fastqc_Cvirginica_MBD/multiqc_data/multiqc_report.html

Everything looks good to me.

Steven’s interested in seeing what the trimmed output would look like (and, how it would impact mapping efficiencies). Will initiate trimming.

See the GitHub issue linked above for the full discussion.

Share

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data Continued

The previous attempt at this was interrupted by a random glitch with our Mox HPC node.

I removed the last files processed by TrimGalore!, just in case they were incomplete. I updated the slurm script to process only the remaining files that had not been processed when the Mox glitch happened (including the files I deemed “incomplete”).

As in the initial run, I kept the option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script: 20180401_trim_galore_illumina_geoduck_hiseq_slurm.sh

MultiQC was run locally once the files were copied to Owl.

Results:

Job completed on 20180404.

Trimmed FASTQs: 20180328_trim_galore_illumina_hiseq_geoduck/

MD5 checksums: 20180328_trim_galore_illumina_hiseq_geoduck/checksums.md5

  • MD5 checksums were generated on Mox node and verified after copying to Owl.

Slurm output file: 20180401_trim_galore_illumina_geoduck_hiseq_slurm.sh

TrimGalore! output: 20180328_trim_galore_illumina_hiseq_geoduck/20180404_trimgalore_reports/

FastQC output: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/

MultiQC output: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/multiqc_data/

MultiQC HTML report: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/multiqc_data/multiqc_report.html

Trimming completed and the FastQC results look much better than before.

Will proceed with full-blown assembly!

Share

FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

Yesterday, I transferred our BGI geoduck data to our Mox node and ran it through FASTQC

I transferred our Illumina HiSeq data sets (*NMP*.fastq.gz) to our Mox node (/gscratch/scrubbed/samwhite/illumina_geoduck_hiseq). These were part of the Illumina-sponsored sequencing project.

I verified the MD5 checksums (not documented) and then ran FASTQC, followed by MultiQC.

FastQC slurm script: 20180328_fastqc_illumina_geoduck_hiseq_slurm.sh

This was followed with MultiQC (locally, after copying the the FastQC output to Owl).

Results:

FASTQC output: 20180328_illumina_hiseq_geoduck_fastqc

MultiQC output: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data

MultiQC HTML report: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data/multiqc_report.html

Well, lots of fails. I high level of “Per Base N Content” (these are only warnings, but we also haven’t received data with these warnings before). Also, they all fail in the “Overrepresented sequences” analysis.

I’ll run these through TrimGalore! (probably twice), and see how things change.

Share

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Previous FastQC/MultiQC analysis of the geoduck Illumina HiSeq data (NMP.fastq.gz files) revealed a high level of overrepresented sequences, high levels of Per Base N Content, failure of Per Sequence GC Content, and a few other bad things.

Ran these through TrimGalore! on our Mox HPC node.

Added an option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script: 20180328_trim_galore_illumina_geoduck_hiseq_slurm.sh

Results:

Slurm output file: slurm-153098.out

I received a job status email on 20180330:

SLURM Job_id=153098 Name=20180328_trim_galore_geoduck_hiseq Failed, Run time 1-17:22:47, FAILED, ExitCode 141

The slurm output file didn’t indicate any errors, so I restarted the job and contacted UW IT to see if I could get more info.

UPDATE

Here’s their response:

04/02/2018 9:13 AM PDT – Matt

Hi Sam,

Your job died because of a networking hiccup that caused GPFS (/gscratch filesystem and such) to expel the node from the GPFS cluster. It’s a symptom of a known ongoing network issue that we’re actively working with Lenovo/Intel/IBM. Things like this aren’t happening super frequently, but enough that we recognized something was wrong and started investigating with vendors. Unfortunately, your job was unlucky and got bitten by it.

So, in short, you or your job didn’t do anything wrong. If you haven’t already (and if it is possible for your use case), we would highly recommend building in some sort of periodic state-preserving behavior (and a method to “resume”) into your longer-running jobs. Jobs can unexpectedly die for any number of reasons, and it is nice not to lose days of compute progress when that happens.

-Matt

Well, okay then.

Share

FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

I transferred our BGI geoduck FASTQ files to our Mox node (/gscratch/scrubbed/samwhite/bgi_geoduck/).

I ran FASTQC on them to actually check them out and see if they needed any trimming, as I don’t believe this has been done!

FASTQC slurm script: 20180327_fastqc_bgi_geoduck_slurm.sh

Side note: Initial FASTQC failed on one file. Turns out, it got corrupted during transfer! Serves as good reminder about the importance of verifying MD5 checksums after file transfer, prior to attempting to work with files!

This was followed up with MultiQC (run locally from my computer on the files hosted on Owl). This was performed the following day (20180328).

Results:

FASTQC output: 20180327_bgi_fastqc

MultiQC output: 20180328_bgi_multiqc

MultiQC HTML report: 20180328_bgi_multiqc/multiqc_report.html

Everything looks nice and clean! Waiting on transfer and FASTQC of Illumina NMP data before proceeding to next assembly attempt.

Share

Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

We would like to get an assembly of the geoduck NovaSeq data that Illumina provided us with.

Steven previously ran the raw data through FASTQC and there was a significant amount of adapter contamination (up to 44% in some libraries) present (see his FASTQC report here).

So, I trimmed them using TrimGalore and re-ran FASTQC on them.

This required two rounds of trimming using the “auto-detect” feature of Trim Galore.

  • Round 1: remove NovaSeq adapters
  • Round 2: remove standard Illumina adapters

See Jupyter notebook below for the gritty details.

Results:

All data for this NovaSeq assembly project can be found here: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/.

Round 1 Trim Galore reports: [20180125_trim_galore_reports/](http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180125_trim_galore_reports/]
Round 1 FASTQC: 20180129_trimmed_multiqc_fastqc_01
Round 1 FASTQC MultiQC overview: 20180129_trimmed_multiqc_fastqc_01/multiqc_report.html


 
 
 
 
 
 

Round 2 Trim Galore reports: 20180125_geoduck_novaseq/20180205_trim_galore_reports/
Round 2 FASTQC: 20180205_trimmed_fastqc_02/
Round 2 FASTQC MultiQC overview: 20180205_trimmed_multiqc_fastqc_02/multiqc_report.html

 
 
 
 
 
 

For the astute observer, you might notice the “Per Base Sequence Content” generates a “Fail” warning for all samples. Per the FASTQC help, this is likely expected (due to the fact that NovaSeq libraries are prepared using transposases) and doesn’t have any downstream impacts on analyses.

 
 
 
 
 
 

Jupyter Notebook (GitHub): 20180125_roadrunner_trimming_geoduck_novaseq.ipynb

Share