# FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

Yesterday, I transferred our BGI geoduck data to our Mox node and ran it through FASTQC

I transferred our Illumina HiSeq data sets (*NMP*.fastq.gz) to our Mox node (/gscratch/scrubbed/samwhite/illumina_geoduck_hiseq). These were part of the Illumina-sponsored sequencing project.

I verified the MD5 checksums (not documented) and then ran FASTQC, followed by MultiQC.

FastQC slurm script: 20180328_fastqc_illumina_geoduck_hiseq_slurm.sh

This was followed with MultiQC (locally, after copying the the FastQC output to Owl).

##### Results:

FASTQC output: 20180328_illumina_hiseq_geoduck_fastqc

MultiQC output: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data

MultiQC HTML report: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data/multiqc_report.html

Well, lots of fails. I high level of “Per Base N Content” (these are only warnings, but we also haven’t received data with these warnings before). Also, they all fail in the “Overrepresented sequences” analysis.

I’ll run these through TrimGalore! (probably twice), and see how things change.

# TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Previous FastQC/MultiQC analysis of the geoduck Illumina HiSeq data (NMP.fastq.gz files) revealed a high level of overrepresented sequences, high levels of Per Base N Content, failure of Per Sequence GC Content, and a few other bad things.

Ran these through TrimGalore! on our Mox HPC node.

Added an option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script: 20180328_trim_galore_illumina_geoduck_hiseq_slurm.sh

#### Results:

Slurm output file: slurm-153098.out

I received a job status email on 20180330:

SLURM Job_id=153098 Name=20180328_trim_galore_geoduck_hiseq Failed, Run time 1-17:22:47, FAILED, ExitCode 141

The slurm output file didn’t indicate any errors, so I restarted the job and contacted UW IT to see if I could get more info.

##### UPDATE

Here’s their response:

04/02/2018 9:13 AM PDT – Matt

Hi Sam,

Your job died because of a networking hiccup that caused GPFS (/gscratch filesystem and such) to expel the node from the GPFS cluster. It’s a symptom of a known ongoing network issue that we’re actively working with Lenovo/Intel/IBM. Things like this aren’t happening super frequently, but enough that we recognized something was wrong and started investigating with vendors. Unfortunately, your job was unlucky and got bitten by it.

So, in short, you or your job didn’t do anything wrong. If you haven’t already (and if it is possible for your use case), we would highly recommend building in some sort of periodic state-preserving behavior (and a method to “resume”) into your longer-running jobs. Jobs can unexpectedly die for any number of reasons, and it is nice not to lose days of compute progress when that happens.

-Matt

Well, okay then.

# FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

I transferred our BGI geoduck FASTQ files to our Mox node (/gscratch/scrubbed/samwhite/bgi_geoduck/).

I ran FASTQC on them to actually check them out and see if they needed any trimming, as I don’t believe this has been done!

FASTQC slurm script: 20180327_fastqc_bgi_geoduck_slurm.sh

Side note: Initial FASTQC failed on one file. Turns out, it got corrupted during transfer! Serves as good reminder about the importance of verifying MD5 checksums after file transfer, prior to attempting to work with files!

This was followed up with MultiQC (run locally from my computer on the files hosted on Owl). This was performed the following day (20180328).

##### Results:

FASTQC output: 20180327_bgi_fastqc

MultiQC output: 20180328_bgi_multiqc

MultiQC HTML report: 20180328_bgi_multiqc/multiqc_report.html

Everything looks nice and clean! Waiting on transfer and FASTQC of Illumina NMP data before proceeding to next assembly attempt.

# Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

We would like to get an assembly of the geoduck NovaSeq data that Illumina provided us with.

Steven previously ran the raw data through FASTQC and there was a significant amount of adapter contamination (up to 44% in some libraries) present (see his FASTQC report here).

So, I trimmed them using TrimGalore and re-ran FASTQC on them.

This required two rounds of trimming using the “auto-detect” feature of Trim Galore.

• Round 1: remove NovaSeq adapters
• Round 2: remove standard Illumina adapters

See Jupyter notebook below for the gritty details.

##### Results:

All data for this NovaSeq assembly project can be found here: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/.

Round 1 Trim Galore reports: [20180125_trim_galore_reports/](http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180125_trim_galore_reports/]
Round 1 FASTQC: 20180129_trimmed_multiqc_fastqc_01
Round 1 FASTQC MultiQC overview: 20180129_trimmed_multiqc_fastqc_01/multiqc_report.html

Round 2 Trim Galore reports: 20180125_geoduck_novaseq/20180205_trim_galore_reports/
Round 2 FASTQC: 20180205_trimmed_fastqc_02/
Round 2 FASTQC MultiQC overview: 20180205_trimmed_multiqc_fastqc_02/multiqc_report.html

For the astute observer, you might notice the “Per Base Sequence Content” generates a “Fail” warning for all samples. Per the FASTQC help, this is likely expected (due to the fact that NovaSeq libraries are prepared using transposases) and doesn’t have any downstream impacts on analyses.

# FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb

# FASTQC – Oly BGI GBS Raw Illumina Data

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.

Results:

Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…

Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb

# Data check on Oly BS-Seq samples

For the 12 samples

Select 4 samples from 1NF gel take 2
Select 4 samples from 2NF gel take 2

Select 2 from gel take 2 Lotterhos
M1
M2
M3

Select 2 from the following sent to Katie (do not have to run on gel)
NF2 14
NF2 6
NF2 18
NF2 15
NF2 17

## Short term will just check out the first 8.¶

These are samples outplanted at Oyster Bay and Fidalgo, and in both cases parents from Fidalgo.

The hypothesis is that Epigenetic pattern will differ (and we can attribute to Environment)

## Quick look at raw data¶

Sequencing Platform: Illumina HiSeq2500

Read Type/Length: 50bp single-end, single index

11_GGCTAC_L001_R1_001.fastq.gz    10933121

12_CTTGTA_L001_R1_001.fastq.gz    10816647

1_ATCACG_L001_R1_001.fastq.gz    9402890

2_CGATGT_L001_R1_001.fastq.gz    11954873

3_TTAGGC_L001_R1_001.fastq.gz    11817358

4_TGACCA_L001_R1_001.fastq.gz    11606618

5_ACAGTG_L001_R1_001.fastq.gz    12589609

6_GCCAAT_L001_R1_001.fastq.gz    12489766

7_CAGATC_L001_R1_001.fastq.gz    10295293

8_ACTTGA_L001_R1_001.fastq.gz    14374642

## Unzip¶

In [1]:
cd /Volumes/Histidine/hectocotylus/whole-BS-01

/Volumes/Histidine/hectocotylus/whole-BS-01

In [2]:
%%bash
for f in *.gz
do
STEM=$(basename "${f}" .gz)
gunzip -c "${f}" > /Volumes/Histidine/hectocotylus/whole-BS-01/fq/"${STEM}"
done


## FastQC¶

In [3]:
!/Applications/bioinfo/FastQC/fastqc
-o /Volumes/Histidine/hectocotylus/whole-BS-01/fq/
-t 4
/Volumes/Histidine/hectocotylus/whole-BS-01/fq/*

Started analysis of 1_ATCACG_L001_R1_001.fastq
Started analysis of 2_CGATGT_L001_R1_001.fastq
Started analysis of 3_TTAGGC_L001_R1_001.fastq
...


this unusual pattern seem to hold true..

In [ ]:

# Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

Previously trimmed the first 39 bases of sequence from reads from the BS-Seq data in an attempt to improve our ability to map the reads back to the C.gigas genome. However, Mac (and Steven) noticed that the last ~10 bases of all the reads showed a steady increase in the %G, suggesting some sort of bias (maybe adaptor??):

Although I didn’t mention this previously, the figure above also shows an odd “waves” pattern that repeats in all bases except for G. Not sure what to think of that…

Quick summary of actions taken (specifics are available in Jupyter notebook below):

• Trim first 39 bases from all reads in all raw sequencing files.
• Trim last 10 bases from all reads in raw sequencing files
• Concatenate the two sets of reads (400ppm and 1000ppm treatments) into single FASTQ files for Steven to work with.

Raw sequencing files:

Notebook Viewer: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC.ipynb

### Output files

Trimmed, concatenated FASTQ files
20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq.gz
20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz

Example of FASTQC analysis pre-trim:

Example FASTQC post-trim (from 400ppm data):

Trimming has removed the intended bad stuff (inconsistent sequence in the first 39 bases and rise in %G in the last 10 bases). Sequences are ready for further analysis for Steven.

However, we still see the “waves” pattern with the T, A and C. Additionally, we still don’t know what caused the weird inconsistencies, nor what sequence is contained therein that might be leading to that. Will contact the sequencing facility to see if they have any insight.

# Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

In another troubleshooting attempt for this problematic BS-seq Illumina data, I’m going to use Trimmomatic to remove the first 39 bases of each read. This is due to the fact that even after the previous quality trimming with Trimmomatic, the first 39 bases still showed inconsistent quality:

Ran Trimmomatic on just a single data set to try things out: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz

Notebook Viewer: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC.ipynb

Results:

Trimmed FASTQ: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001.fastq.gz

You can see how flat the newly trimmed data is (which is what one would expect).

Steven will take this trimmed dataset and try additional mapping with it to see if removal of the first 39 bases will improve the mapping.

# Quality Trimming – C.gigas Larvae OA BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb