# Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

Previously trimmed the first 39 bases of sequence from reads from the BS-Seq data in an attempt to improve our ability to map the reads back to the C.gigas genome. However, Mac (and Steven) noticed that the last ~10 bases of all the reads showed a steady increase in the %G, suggesting some sort of bias (maybe adaptor??):

Although I didn’t mention this previously, the figure above also shows an odd “waves” pattern that repeats in all bases except for G. Not sure what to think of that…

Quick summary of actions taken (specifics are available in Jupyter notebook below):

• Trim first 39 bases from all reads in all raw sequencing files.
• Trim last 10 bases from all reads in raw sequencing files
• Concatenate the two sets of reads (400ppm and 1000ppm treatments) into single FASTQ files for Steven to work with.

Raw sequencing files:

Notebook Viewer: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC.ipynb

### Output files

Trimmed, concatenated FASTQ files
20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq.gz
20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz

Example of FASTQC analysis pre-trim:

Example FASTQC post-trim (from 400ppm data):

Trimming has removed the intended bad stuff (inconsistent sequence in the first 39 bases and rise in %G in the last 10 bases). Sequences are ready for further analysis for Steven.

However, we still see the “waves” pattern with the T, A and C. Additionally, we still don’t know what caused the weird inconsistencies, nor what sequence is contained therein that might be leading to that. Will contact the sequencing facility to see if they have any insight.

# Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

In another troubleshooting attempt for this problematic BS-seq Illumina data, I’m going to use Trimmomatic to remove the first 39 bases of each read. This is due to the fact that even after the previous quality trimming with Trimmomatic, the first 39 bases still showed inconsistent quality:

Ran Trimmomatic on just a single data set to try things out: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz

Notebook Viewer: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC.ipynb

Results:

Trimmed FASTQ: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001.fastq.gz

You can see how flat the newly trimmed data is (which is what one would expect).

Steven will take this trimmed dataset and try additional mapping with it to see if removal of the first 39 bases will improve the mapping.

# BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

Jupyter (IPython) Notebook : 20150501_Cgigas_larvae_OA_BLASTn_nt.ipynb

Notebook Viewer : 20150501_Cgigas_larvae_OA_BLASTn_nt

Results:

BLASTn Output File: 20150501_nt_blastn.tab

BLAST e-vals <= 0.001: 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Unique BLAST Species: 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Firstly, since this library was bisulfite converted, we know that matching won’t be as robust as we’d normally see.

However, the BLAST matches for this are terrible.

Only 0.65% of the BLAST matches (e-value <0.001) are to Crassostrea gigas. Yep, you read that correctly: 0.65%.

It’s nearly 40-fold less than the top species: Dictyostelium discoideum (a slime mold)

It’s 30-fold less than the next species: Danio rerio (zebra fish)

Then it’s followed up by human and mouse.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

# Goals – May 2015

Here are the things I plan to tackle throughout the month of May:

### Geoduck Reproductive Development Transcriptomics

My primary goal for this project is to successfully isolate RNA from the remaining, troublesome paraffin blocks that have yet to yield any usable RNA. The next approach to obtain usable quantities of RNA is to directly gouge tissue from the blocks instead of sectioning the blocks (as recommended in the PAXgene Tissue RNA Kit protocol). Hopefully this approach will eliminate excess paraffin, while increasing the amount of input tissue. Once I have RNA from the entire suite of samples, I’ll check the RNA integrity via Bioanalyzer and then we’ll decide on a facility to use for high-throughput sequencing.

### BS-Seq Illumina Data Assembly/Mapping

Currently, there are two projects that we have performed BS-Seq with (Crassostrea gigas larvae OA (2011) bisulfite sequencing and LSU C.virginica Oil Spill MBD BS Sequencing) and we’re struggling to align sequences to the C.gigas genome. Granted, the LSU samples are C.virginica, but the C.gigas larvae libraries are not aligning to the C.gigas genome via standard BLASTn or using a dedicated bisulfite mapper (e.g. BS-Map). I’m currently BLASTing a de-novo assembly of the C.gigas larvae OA 400ppm sequencing that Steven made against the NCBI nt DB in an attempt to assess the taxonomic distribution of the sequences we received back. I’ll also try using a different bisulfite mapper, bismark, that Mackenzie Gavery has previously used and has had better results with than BS-Map.

### C.gigas Heat Stress MeDIP/BS-Seq

As part of Claire’s project, there’s still some BS-Seq data that would be nice to have to complement the data she generated via microarray. It would be nice to make a decision about how to proceed with the samples. However, part of our decision on how to proceed is governed by the results we get from the two projects above. Why do those two projects impact the decision(s) regarding this project? They impact this project because in the two projects above, we produced our own BS-Seq libraries. This is extremely cost effective. However, if we can’t obtain usable data from doing the library preps in-house, then that means we have to use an external service provider. Using an external company to do this is significantly more expensive. Additionally, not all companies can perform bisulfite treatment, which limits our choices (and, in turn, pricing options) on where to go for sequencing.

### Miscellany

When I have some down time, I’ll continue working on migrating my Wikispaces notebook to this notebook. I only have one year left to go and it’d be great is all my notebook entries were here so they’d all be tagged/categorized and, thus, be more searchable. I’d also like to work on adding README files to our plethora of electronic data folders. Having these in place will greatly facilitate the ability of people to quickly and more easily figure out what these folders contain, file formats within those folders, etc. I also have a few computing tips/tricks that I’d like to add to our Github “Code” page. Oh, although this isn’t really lab related, I was asked to teach the Unix shell lesson (or, at least, part of it) at the next Software Carpentry Workshop that Ben Marwick is setting up at UW in early June. So, I’m thinking that I’ll try to incorporate some of the data handling stuff I’ve been tackling in lab in to the lesson I end up teaching. Additionally, going through the Software Carpentry materials will help reinforce some of the “fundamental” tasks that I can do with the shell (like find, cut and grep).

In the lab, I plan on sealing up our nearly overflowing “Broken Glass” box and establishing a new one. I need to autoclave, and dispose of, a couple of very full biohazard bags. I’m also going to vow that I will get Jonathan to finally obtain a successful PCR from his sea pen RNA.

# BLASTN – C.gigas OA Larvae to C.gigas Ensembl 1.24 BLAST DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

I also created a nucleotide BLAST database (DB) from the Crassostrea_gigas.GCA_000297895.1.24.fa

Jupyter (IPython) Notebook: 20150429_Gigas_larvae_OA_BLASTn.ipynb

Notebook Viewer: 20150429_Gigas_larvae_OA_BLASTn

Results:

The results are not great.

All query contigs successfully BLAST to sequences in the C.gigas Ensembl BLAST DB. However, only 33 of the sequences (out of ~37,000) have an e-value of 0.0. The next best e-value for any matches is 0.001. For the uninitiated, that value is not very good, especially when you’re BLASTing against the same exact species DB.

Will BLASTn the C.gigas contigs against the entire GenBank nt (all nucleotides) to see what the taxonomy breakdown looks like of these sequences.

# Quality Trimming – C.gigas Larvae OA BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

# Sequence Data Analysis – C.gigas Larvae OA BS-Seq Data

Compared total amount of data generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes. Index: CTTGTA $ ls -l 2212_lane2_[C]* | awk '{sum += $5} END {print sum/1024/1024/1024}' 5.33341 Index: GCCAAT $ ls -l 2212_lane2_[G]* | awk '{sum += $5} END {print sum/1024/1024/1024}' 7.00596 There’s ~1.4x data in the GCCAAT files. Ran FASTQC on the following files downloaded earlier today: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz 2212_lane2_CTTGTA_L002_R1_002.fastq.gz 2212_lane2_CTTGTA_L002_R1_003.fastq.gz 2212_lane2_CTTGTA_L002_R1_004.fastq.gz 2212_lane2_GCCAAT_L002_R1_001.fastq.gz 2212_lane2_GCCAAT_L002_R1_002.fastq.gz 2212_lane2_GCCAAT_L002_R1_003.fastq.gz 2212_lane2_GCCAAT_L002_R1_004.fastq.gz 2212_lane2_GCCAAT_L002_R1_005.fastq.gz 2212_lane2_GCCAAT_L002_R1_006.fastq.gz The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle: $for file in /Volumes/nightingales/C_gigas/2212_lane2_[CG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date: $for file in 2212_lane2_[GC]*; do mv "$file" "20150413_$file"; done

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2212_lane2_[CG]*.zip; do unzip "$file"; done

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2212_lane2_[GC]*; do mv “$file” “20150413_$file”; done # Sequence Data – C.gigas OA Larvae BS-Seq Demultiplexed I had previously contacted Doug Turnbull at the Univ. of Oregon Genomics Core Facility for help demultiplexing this data, as it was initially returned to us as a single data set with “no index” (i.e. barcode) set for any of the libraries that were sequenced. As it turns out, when multiplexed libraries are sequenced using the Illumina platform, an index read step needs to be “enabled” on the machine for sequencing. Otherwise, the machine does not perform the index read step (since it wouldn’t be necessary for a single library). Surprisingly, the sample submission form for the Univ. of Oregon Genomics Core Facility doesn’t request any information regarding whether or not a submitted sample has been multiplexed. However, by default, they enable the index read step on all sequencing runs. I provided them with the barcodes and they demultiplexed them after the fact. I downloaded the new, demultiplexed files to Owl/nightingales/C_gigas: lane2_CTTGTA_L002_R1_001.fastq.gz lane2_CTTGTA_L002_R1_002.fastq.gz lane2_CTTGTA_L002_R1_003.fastq.gz lane2_CTTGTA_L002_R1_004.fastq.gz lane2_GCCAAT_L002_R1_001.fastq.gz lane2_GCCAAT_L002_R1_002.fastq.gz lane2_GCCAAT_L002_R1_003.fastq.gz lane2_GCCAAT_L002_R1_004.fastq.gz lane2_GCCAAT_L002_R1_005.fastq.gz lane2_GCCAAT_L002_R1_006.fastq.gz Notice that the file names now contain the corresponding index! Renamed the files, to append the order number to the beginning of the file names: $for file in lane2*; do mv "$file" "2212_$file"; done

New file names:

2212_lane2_CTTGTA_L002_R1_001.fastq.gz
2212_lane2_CTTGTA_L002_R1_002.fastq.gz
2212_lane2_CTTGTA_L002_R1_003.fastq.gz
2212_lane2_CTTGTA_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_001.fastq.gz
2212_lane2_GCCAAT_L002_R1_002.fastq.gz
2212_lane2_GCCAAT_L002_R1_003.fastq.gz
2212_lane2_GCCAAT_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_005.fastq.gz
2212_lane2_GCCAAT_L002_R1_006.fastq.gz

Updated the checksums.md5 file to include the new files (the command is written to exclude the previously downloaded files that are named “2212_lane2_NoIndex_”; the [^N] regex excludes any files that have a capital ‘N’ at that position in the file name):

$for file in 2212_lane2_[^N]*; do md5 "$file" >> checksums.md5; done

Updated the readme.md file to reflect the addition of these new files.

# Sequencing Data – C.gigas Larvae OA

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2212).

Samples sequenced/pooled for this run:

 Sample Treatment Barcode 400ppm 400ppm GCCAAT 1000ppm 1000ppm CTTGTA

All code listed below was run on OS X 10.9.5

#!/bin/bash curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_001.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_002.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_003.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_004.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_005.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_006.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_007.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_008.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_009.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_010.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_011.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_012.fastq.gz

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2212_:

$for file in 2212*lane2_NoIndex_L002_R1_0*; do mv "$file" "${file/#2212?fileName=/2212_}"; done Created a directory readme.md (markdown) file to list & describe directory contents: readme.md $ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

Create MD5 checksums for each the files: checkums.md5

\$md5 2212* >> checksums.md5

# Library Quality Assessment – C.gigas OA larvae Illumina libraries

Ran the 400ppm library and the 1000ppm library preps on a DNA1000 Assay Chip (Agilent) on the Agilent 2100 Bioanalyzer.

Results:

Electropherogram overlay of both samples:

Red = 400ppm

Blue = 1000ppm

Measurement data and parameters are here: 20150302_Bioanalyzer_Cgigas_400_1000ppm_BS-Seq

Both libraries look good; no adaptor contamination (peak would be present at ~125bp), good library sizes.

Pooled equal quantities of each library, based off the concentration values above, to prepare the sample for sequencing.

 Component Volume (μL) Quantity (ng) 400ppm library 10 14.7 1000ppm library 1.09 14.7 Buffer EB 7.81 N/A 1% Tween20 2.1 N/A Total 21 N/A

The pooled libraries will be submitted tomorrow to the Genomics Core Facility at the Univ. of Oregon for high-throughput sequencing (100bp, SE) on the HiSeq2500 (Illumina). Sample order #2212.