# Goals – May 2015

Here are the things I plan to tackle throughout the month of May:

### Geoduck Reproductive Development Transcriptomics

My primary goal for this project is to successfully isolate RNA from the remaining, troublesome paraffin blocks that have yet to yield any usable RNA. The next approach to obtain usable quantities of RNA is to directly gouge tissue from the blocks instead of sectioning the blocks (as recommended in the PAXgene Tissue RNA Kit protocol). Hopefully this approach will eliminate excess paraffin, while increasing the amount of input tissue. Once I have RNA from the entire suite of samples, I’ll check the RNA integrity via Bioanalyzer and then we’ll decide on a facility to use for high-throughput sequencing.

### BS-Seq Illumina Data Assembly/Mapping

Currently, there are two projects that we have performed BS-Seq with (Crassostrea gigas larvae OA (2011) bisulfite sequencing and LSU C.virginica Oil Spill MBD BS Sequencing) and we’re struggling to align sequences to the C.gigas genome. Granted, the LSU samples are C.virginica, but the C.gigas larvae libraries are not aligning to the C.gigas genome via standard BLASTn or using a dedicated bisulfite mapper (e.g. BS-Map). I’m currently BLASTing a de-novo assembly of the C.gigas larvae OA 400ppm sequencing that Steven made against the NCBI nt DB in an attempt to assess the taxonomic distribution of the sequences we received back. I’ll also try using a different bisulfite mapper, bismark, that Mackenzie Gavery has previously used and has had better results with than BS-Map.

### C.gigas Heat Stress MeDIP/BS-Seq

As part of Claire’s project, there’s still some BS-Seq data that would be nice to have to complement the data she generated via microarray. It would be nice to make a decision about how to proceed with the samples. However, part of our decision on how to proceed is governed by the results we get from the two projects above. Why do those two projects impact the decision(s) regarding this project? They impact this project because in the two projects above, we produced our own BS-Seq libraries. This is extremely cost effective. However, if we can’t obtain usable data from doing the library preps in-house, then that means we have to use an external service provider. Using an external company to do this is significantly more expensive. Additionally, not all companies can perform bisulfite treatment, which limits our choices (and, in turn, pricing options) on where to go for sequencing.

### Miscellany

When I have some down time, I’ll continue working on migrating my Wikispaces notebook to this notebook. I only have one year left to go and it’d be great is all my notebook entries were here so they’d all be tagged/categorized and, thus, be more searchable. I’d also like to work on adding README files to our plethora of electronic data folders. Having these in place will greatly facilitate the ability of people to quickly and more easily figure out what these folders contain, file formats within those folders, etc. I also have a few computing tips/tricks that I’d like to add to our Github “Code” page. Oh, although this isn’t really lab related, I was asked to teach the Unix shell lesson (or, at least, part of it) at the next Software Carpentry Workshop that Ben Marwick is setting up at UW in early June. So, I’m thinking that I’ll try to incorporate some of the data handling stuff I’ve been tackling in lab in to the lesson I end up teaching. Additionally, going through the Software Carpentry materials will help reinforce some of the “fundamental” tasks that I can do with the shell (like find, cut and grep).

In the lab, I plan on sealing up our nearly overflowing “Broken Glass” box and establishing a new one. I need to autoclave, and dispose of, a couple of very full biohazard bags. I’m also going to vow that I will get Jonathan to finally obtain a successful PCR from his sea pen RNA.

# Quality Trimming – C.gigas Larvae OA BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

# Quality Trimming – LSU C.virginica Oil Spill MBD BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_virginica_LSU_Oil_Spill_Trimmomatic_FASTQC.ipynb

### Trimmed FASTQC

#### HB30 25,000ppm oil Index – TGACCA

20150414_trimmed_2112_lane1_TGACCA_L001_R1_001_fastqc.html

# Sequence Data Analysis – LSU C.virginica Oil Spill MBD BS-Seq Data

Performed some rudimentary data analysis on the new, demultiplexed data downloaded earlier today:

2112_lane1_ACAGTG_L001_R1_001.fastq.gz
2112_lane1_ACAGTG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_001.fastq.gz
2112_lane1_ATCACG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_003.fastq.gz
2112_lane1_CAGATC_L001_R1_001.fastq.gz
2112_lane1_CAGATC_L001_R1_002.fastq.gz
2112_lane1_CAGATC_L001_R1_003.fastq.gz
2112_lane1_GCCAAT_L001_R1_001.fastq.gz
2112_lane1_GCCAAT_L001_R1_002.fastq.gz
2112_lane1_TGACCA_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_002.fastq.gz

Compared total amount of data (in gigabytes) generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes. Index: ACAGTG $ls -l 2112_lane1_AC* | awk '{sum += $5} END {print sum/1024/1024/1024}' 1.49652 Index: ATCACG $ls -l 2112_lane1_AT* | awk '{sum += $5} END {print sum/1024/1024/1024}' 3.02269 Index: CAGATC $ls -l 2112_lane1_CA* | awk '{sum += $5} END {print sum/1024/1024/1024}' 3.49797 Index: GCCAAT $ls -l 2112_lane1_GC* | awk '{sum += $5} END {print sum/1024/1024/1024}' 2.21379 Index: TGACCA $ls -l 2112_lane1_TG* | awk '{sum += $5} END {print sum/1024/1024/1024}' 0.687374 Index: TTAGGC $ls -l 2112_lane1_TT* | awk '{sum += $5} END {print sum/1024/1024/1024}' 2.28902 Ran FASTQC on the following files downloaded earlier today. The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle: $for file in /Volumes/nightingales/C_virginica/2112_lane1_[ATCG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date: $for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2112_lane1_[ATCG]*.zip; do unzip "$file"; done

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done The FASTQC results are linked below: # Sequence Data Analysis – C.gigas Larvae OA BS-Seq Data Compared total amount of data generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes.

Index: CTTGTA

$ls -l 2212_lane2_[C]* | awk '{sum +=$5} END {print sum/1024/1024/1024}' 5.33341

Index: GCCAAT
$ls -l 2212_lane2_[G]* | awk '{sum +=$5} END {print sum/1024/1024/1024}' 7.00596

There’s ~1.4x data in the GCCAAT files.

2212_lane2_CTTGTA_L002_R1_001.fastq.gz
2212_lane2_CTTGTA_L002_R1_002.fastq.gz
2212_lane2_CTTGTA_L002_R1_003.fastq.gz
2212_lane2_CTTGTA_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_001.fastq.gz
2212_lane2_GCCAAT_L002_R1_002.fastq.gz
2212_lane2_GCCAAT_L002_R1_003.fastq.gz
2212_lane2_GCCAAT_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_005.fastq.gz
2212_lane2_GCCAAT_L002_R1_006.fastq.gz

The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle:

$for file in /Volumes/nightingales/C_gigas/2212_lane2_[CG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date:

$for file in 2212_lane2_[GC]*; do mv "$file" "20150413_$file"; done Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry: $for file in 20150413_2212_lane2_[CG]*.zip; do unzip "$file"; done The unzip output retained the old naming scheme, so I renamed the unzipped folders:$for file in 2212_lane2_[GC]*; do mv “$file” “20150413_$file”; done

# Sequence Data – C.gigas OA Larvae BS-Seq Demultiplexed

I had previously contacted Doug Turnbull at the Univ. of Oregon Genomics Core Facility for help demultiplexing this data, as it was initially returned to us as a single data set with “no index” (i.e. barcode) set for any of the libraries that were sequenced. As it turns out, when multiplexed libraries are sequenced using the Illumina platform, an index read step needs to be “enabled” on the machine for sequencing. Otherwise, the machine does not perform the index read step (since it wouldn’t be necessary for a single library). Surprisingly, the sample submission form for the Univ. of Oregon Genomics Core Facility  doesn’t request any information regarding whether or not a submitted sample has been multiplexed. However, by default, they enable the index read step on all sequencing runs. I provided them with the barcodes and they demultiplexed them after the fact.

lane2_CTTGTA_L002_R1_001.fastq.gz
lane2_CTTGTA_L002_R1_002.fastq.gz
lane2_CTTGTA_L002_R1_003.fastq.gz
lane2_CTTGTA_L002_R1_004.fastq.gz
lane2_GCCAAT_L002_R1_001.fastq.gz
lane2_GCCAAT_L002_R1_002.fastq.gz
lane2_GCCAAT_L002_R1_003.fastq.gz
lane2_GCCAAT_L002_R1_004.fastq.gz
lane2_GCCAAT_L002_R1_005.fastq.gz
lane2_GCCAAT_L002_R1_006.fastq.gz

Notice that the file names now contain the corresponding index!

Renamed the files, to append the order number to the beginning of the file names:

$for file in lane2*; do mv "$file" "2212_$file"; done New file names: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz 2212_lane2_CTTGTA_L002_R1_002.fastq.gz 2212_lane2_CTTGTA_L002_R1_003.fastq.gz 2212_lane2_CTTGTA_L002_R1_004.fastq.gz 2212_lane2_GCCAAT_L002_R1_001.fastq.gz 2212_lane2_GCCAAT_L002_R1_002.fastq.gz 2212_lane2_GCCAAT_L002_R1_003.fastq.gz 2212_lane2_GCCAAT_L002_R1_004.fastq.gz 2212_lane2_GCCAAT_L002_R1_005.fastq.gz 2212_lane2_GCCAAT_L002_R1_006.fastq.gz Updated the checksums.md5 file to include the new files (the command is written to exclude the previously downloaded files that are named “2212_lane2_NoIndex_”; the [^N] regex excludes any files that have a capital ‘N’ at that position in the file name): $for file in 2212_lane2_[^N]*; do md5 "$file" >> checksums.md5; done Updated the readme.md file to reflect the addition of these new files. # Sequence Data – LSU C.virginica Oil Spill MBD BS-Seq Demultiplexed I had previously contacted Doug Turnbull at the Univ. of Oregon Genomics Core Facility for help demultiplexing this data, as it was initially returned to us as a single data set with “no index” (i.e. barcode) set for any of the libraries that were sequenced. As it turns out, when multiplexed libraries are sequenced using the Illumina platform, an index read step needs to be “enabled” on the machine for sequencing. Otherwise, the machine does not perform the index read step (since it wouldn’t be necessary for a single library). Surprisingly, the sample submission form for the Univ. of Oregon Genomics Core Facility doesn’t request any information regarding whether or not a submitted sample has been multiplexed. However, by default, they enable the index read step on all sequencing runs. I provided them with the barcodes and they demultiplexed them after the fact. I downloaded the new, demultiplexed files to Owl/nightingales/C_virginica: lane1_ACAGTG_L001_R1_001.fastq.gz lane1_ACAGTG_L001_R1_002.fastq.gz lane1_ATCACG_L001_R1_001.fastq.gz lane1_ATCACG_L001_R1_002.fastq.gz lane1_ATCACG_L001_R1_003.fastq.gz lane1_CAGATC_L001_R1_001.fastq.gz lane1_CAGATC_L001_R1_002.fastq.gz lane1_CAGATC_L001_R1_003.fastq.gz lane1_GCCAAT_L001_R1_001.fastq.gz lane1_GCCAAT_L001_R1_002.fastq.gz lane1_TGACCA_L001_R1_001.fastq.gz lane1_TTAGGC_L001_R1_001.fastq.gz lane1_TTAGGC_L001_R1_002.fastq.gz Notice that the file names now contain the corresponding index! Renamed the files, to append the order number to the beginning of the file names: $for file in lane1*; do mv "$file" "2112_$file"; done

New file names:

2112_lane1_ACAGTG_L001_R1_001.fastq.gz
2112_lane1_ACAGTG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_001.fastq.gz
2112_lane1_ATCACG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_003.fastq.gz
2112_lane1_CAGATC_L001_R1_001.fastq.gz
2112_lane1_CAGATC_L001_R1_002.fastq.gz
2112_lane1_CAGATC_L001_R1_003.fastq.gz
2112_lane1_GCCAAT_L001_R1_001.fastq.gz
2112_lane1_GCCAAT_L001_R1_002.fastq.gz
2112_lane1_TGACCA_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_002.fastq.gz

Updated the checksums.md5 file to include the new files (the command is written to exclude the previously downloaded files that are named “2112_lane1_NoIndex_”; the [^N] regex excludes any files that have a capital ‘N’ at that position in the file name):

$for file in 2112_lane1_[^N]*; do md5 "$file" >> checksums.md5; done

Updated the readme.md file to reflect the addition of these new files.

# Sequencing Data – C.gigas Larvae OA

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2212).

Samples sequenced/pooled for this run:

 Sample Treatment Barcode 400ppm 400ppm GCCAAT 1000ppm 1000ppm CTTGTA

All code listed below was run on OS X 10.9.5

#!/bin/bash curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_001.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_002.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_003.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_004.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_005.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_006.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_007.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_008.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_009.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_010.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_011.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_012.fastq.gz

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2212_:

$for file in 2212*lane2_NoIndex_L002_R1_0*; do mv "$file" "${file/#2212?fileName=/2212_}"; done Created a directory readme.md (markdown) file to list & describe directory contents: readme.md $ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

Create MD5 checksums for each the files: checkums.md5

\$md5 2212* >> checksums.md5

# Library Quality Assessment – C.gigas OA larvae Illumina libraries

Ran the 400ppm library and the 1000ppm library preps on a DNA1000 Assay Chip (Agilent) on the Agilent 2100 Bioanalyzer.

Results:

Electropherogram overlay of both samples:

Red = 400ppm

Blue = 1000ppm

Measurement data and parameters are here: 20150302_Bioanalyzer_Cgigas_400_1000ppm_BS-Seq

Both libraries look good; no adaptor contamination (peak would be present at ~125bp), good library sizes.

Pooled equal quantities of each library, based off the concentration values above, to prepare the sample for sequencing.

 Component Volume (μL) Quantity (ng) 400ppm library 10 14.7 1000ppm library 1.09 14.7 Buffer EB 7.81 N/A 1% Tween20 2.1 N/A Total 21 N/A

The pooled libraries will be submitted tomorrow to the Genomics Core Facility at the Univ. of Oregon for high-throughput sequencing (100bp, SE) on the HiSeq2500 (Illumina). Sample order #2212.

# BS-seq Library Prep – C.gigas Larvae OA 1000ppm

### Bisulfite Conversion

Pooled 200ng each of the sheared 1B1 (4μL) & 1B2 (used the entire sample, 20μL) 5.13.11 1000ppm C.gigas larvae DNA samples for a total of 400ng. Total volume = 24μL.

Quantified the pooled DNA using the NanoDrop1000 (ThermoFisher) prior to initiating bisulfite conversion.

Clearly, the NanoDrop measurements differ from the expected concentration. NanoDrop suggests the total amount of input DNA is ~1400ng (58ng/μL x 24μL = 1392ng). This is most likely due to RNA carryover, as DNA quantification using a fluorescence-based, double-stranded DNA assay performed previously shows a drastically lower concentration.

Proceeded with bisulfite conversion using the Methylamp DNA Modification Kit (Epigentek) in 1.5mL tube, according to the manufacturer’s protocol:

• Added 1μL to sample, incubated 10mins @ 37C in water bath
• Made fresh R1/R2/R3 solution (1.1mL R3 buffer added to vial of R2, vortexed 2mins, 40μL R1 added to mixture – Remainder stored @ -20C in “-20C Kit Components Box”)
• Added 125μL of R1/R2/R3 solution to sample, incubated 90mins @ 65C in heating block with water
• Addd 300μL R4 to sample, mixed, transferred to column, spun 12,000RPM 30s
• Added 200μL R5 to column, spun 12,000RPM 30s
• Added 50μL R1/ethanol solution to column, incubated 8mins @ RT, spun 12,000RPM 30s
• Washed column with 200μL of 90% EtOH, spun 12,000RPM 30s; repeated one time.
• Eluted DNA with 12μL R6, spun 12,000RPM 30s

Quantified post-bisulfite-treated sample on NanoDrop1000:

Definitely a low yield (~108ng) relative to the input (~400ng). Will proceed with Illumina library prep.

### Library Prep

Illumina library prep was performed with EpiNext Post-Bisulfite DNA Library Preparation Kit (Illumina) (Epigentek).  Changes to the manufacturer’s protocol:

• Samples were transferred to 1.5mL snap cap tubes for all magnetic bead steps in order to fit in our tube magnets.
• PCR cycles: 15

No other changes were made to the manufacturer’s protocol.

Epigentek Barcode Indices assigned, per their recommendations for using two libraries for multiplexing (this will be combined with the 400ppm library):

Barcode #12 – CTTGTA

The library was stored @ -20C and will be checked via Bioanalyzer on Monday.