Category Archives: Miscellaneous

DNA Methylation Analysis – Bismark Pipeline on All Olympia oyster BSseq Datasets

Bismark analysis of all of our current Olympia oyster (Ostrea lurida) DNA methylation high-throughput sequencing data.

Analysis was run on Emu (Ubuntu 16.04LTS, Apple Xserve). The primary analysis took ~14 days to complete.

All operations are documented in a Jupyter notebook (GitHub):

Genome used:


Input files ( see Olympia oyster Genomic GitHub wiki for more info ):

WG BSseq of Fidalgo Bay offspring grown in Fidalgo Bay & Oyster Bay
  • 1_ATCACG_L001_R1_001.fastq.gz

  • 2_CGATGT_L001_R1_001.fastq.gz

  • 3_TTAGGC_L001_R1_001.fastq.gz

  • 4_TGACCA_L001_R1_001.fastq.gz

  • 5_ACAGTG_L001_R1_001.fastq.gz

  • 6_GCCAAT_L001_R1_001.fastq.gz

  • 7_CAGATC_L001_R1_001.fastq.gz

  • 8_ACTTGA_L001_R1_001.fastq.gz

MBDseq of two populations (Hood Canal & Oyster Bay) grown in Clam Bay
  • zr1394_10_s456.fastq.gz

  • zr1394_11_s456.fastq.gz

  • zr1394_12_s456.fastq.gz

  • zr1394_13_s456.fastq.gz

  • zr1394_14_s456.fastq.gz

  • zr1394_15_s456.fastq.gz

  • zr1394_16_s456.fastq.gz

  • zr1394_17_s456.fastq.gz

  • zr1394_18_s456.fastq.gz

  • zr1394_1_s456.fastq.gz

  • zr1394_2_s456.fastq.gz

  • zr1394_3_s456.fastq.gz

  • zr1394_4_s456.fastq.gz

  • zr1394_5_s456.fastq.gz

  • zr1394_6_s456.fastq.gz

  • zr1394_7_s456.fastq.gz

  • zr1394_8_s456.fastq.gz

  • zr1394_9_s456.fastq.gz


RESULTS:

With Bismark complete, these two sets of analyses can now be looked into further (and separately, as they are separate experiments) using things like MethylKit (R package) and
the Integrative Genomics Viewer (IGV).

Output folder:

Bismark Summary Report:

Individual Sample Reports:

Data Received – Geoduck Metagenome HiSeqX Data

Received the data from the geoduck metagenome libraries that I prepared and were sequenced at the Northwest Genomics Center at UW on the HiSeqX (Illumina) – PE 151bp.

FastQ files are being transferred to owl/nightingales/P_generosa.

These aren’t geoduck sequences, but they are part of a geoduck project. Maybe I should establish a metagenomics directory under nightingales?

Will verifiy md5 checksums and update readme file once the transfer is complete.

RNA Isolation & Quantificaiton – Tanner Crab Hemolymph

Isolated RNA from 40 Tanner crab hemolymph samples selected by Grace with the RNeasy Plus Micro Kit (Qiagen) according to the manufacturer’s protocol, with the following modifications:

  • Added mercaptoethanol (2-ME) to Buffer RLT Plus.

  • All spins were at 21,130g

  • Did not add RNA carrier

  • Used QIAshredder columns to aid in homogenization and removal of insoluble material

  • Eluted with 14uL

RNA was quantified using the Qubit RNA HS (high sensitivity) Assay and run on the Roberts Lab Qubit 3.0.

Used 1uL of sample for quantification.

RNA was returned to the -80C box from where original samples had been stored (Rack 2, Row 3, Column 4).


RESULTS

Qubit quantification (Google Sheet):

Overall, the results aren’t great. Only 15 samples (out of 40) had detectable amounts of RNA. Yields from those 15 samples ranged from 40ng – 300ng, with most landing between 50 – 100ng.

Will pass info along to Grace. Will likely meet with her and Steven to discuss plan on how to move forward.

Bioanalyzer – Tanner Crab RNA Isolated with RNeasy Plus Mini Kit

Ran the four Tanner crab RNA samples that I isolated yesterday on the Seeb Lab Bioanalyzer 2100 (Agilent) using the RNA Pico 6000 Kit.

Samples were run following kit protocol:

  • Chip priming station in Position C with syringe clip at top position

  • RNA denatured at 70C for 2mins and stored on ice.

  • RNA ladder aliquot was from 20160826 by Hollie Putnam.


RESULTS

Bioanalyzer data file (XAD):

ELECTROPHEROGRAMS:


GEL REPRESENATATIONS


These results look great to me. Clear, defined peaks/bands, representing ribosomal RNA.

Oddly, one sample (crab_506) appears to be shifted, relative to the other three, despite exhibiting the same peak/banding pattern. Not sure what would cause something like this; contaminants?

Regardless, we finally have clean RNA and have a usable Bioanalyzer profile to use for reference for crab RNA.

NOTE: The lanes marked with red on the gel representation image indicate that a ribosomal integrity number (RIN) could not be calculated. This is to be expected! The RIN is based on the expectation of two rRNA bands. The anomaly is sample crab_451 – a RIN was actually determined for that sample!

Will likely move forward with additional RNA isolations using the RNeasy Plus Kit (Qiagen).

RNA Cleanup – Tanner Crab RNA

In a continued attempt to figure out what we can do about the tanner crab RNA, Steven tasked me with using an RNeasy Kit to cleanup some existing RNA.

Here’re the samples grace provided:


All of the RNA had some sort of undissolved/insoluble material present. Here’s an example (this is the worts of the bunch – others did not have such large/dense pellets):


Samples were cleaned up using the [RNeasy Plus Mini Kit (Qiagen)]. Added 350uL of Buffer RLT Plus (no beta-mercaptoethanol added) to each sample, vortexed, and then processed according to the manufacturer’s protocol (skipped gDNA Eliminator spin column step).

Samples were eluted with 30uL of nuclease-free water.

Samples were quantified using the Roberts Lab Qubit 3.0 with the RNA High Sensitivity asssay (Invitrogen). Used 5uL of sample for measurements.

Samples were also assessed with the Roberts Lab NandoDrop1000.

Samples were recovered from the pedestal after measurement.

RNA was given to Grace for storage at -80C.


RESULTS

Qubit measurements (Google Sheet):
20180731_qubit_RNA_crab_cleanup


NanoDrop Table:


All concentrations were too low for detection via NanoDrop.

Qubit quantification indicate yields ranging from ~25ng to ~192.5ng.

Will share info with Grace and let her compare these numbers to her original concentrations to see if there’s any differences.

Regardless, based on my earlier RNA isolation today, these samples should now be much cleaner and we should be able to trust the Qubit quantifications.

RNA Isolation – Tanner Crab Hemolymph Using RNeasy Plus Mini Kit

Tanner crab RNA has proved a bit troublesome. As such, Steven asked me to try isolating some RNA using the RNeasy Plus Mini Kit (Qiagen) to see how things would turn out.

Grace provided me with the following samples:


Crab hemolymph had been collected (100uL?) and preserved with 1mL (?) of RNAlater. Grace pelleted the samples, removed the supernatant, and stored the pelleted material at -80C. Here’s what that looked like:


RNA was isolated according to the manufacturer’s protocol – following guideline for samples with < 1 x 106 cells.

One interesting thing that happened is a precipitate formed after adding the initial buffer to the sample:

A solid precipitate formed in each of the tubes that could not be dispersed – it actually looked like a small piece of paper was now present in each tube.

Samples were spun and the supernatant was utilized (this was the normal progression of the protocol, regardless of this precipitate forming).

Samples were eluted with 30uL of nuclease-free water.

Samples were quantified using the Roberts Lab Qubit 3.0 with the RNA High Sensitivity asssay (Invitrogen). Used 5uL of sample for measurements.

Samples were also assessed with the Roberts Lab NandoDrop1000. Samples were recovered from the pedestal after measurement.

RNA was given to Grace for storage at -80C.


RESULTS

Qubit measurements (Google Sheet):
20180731_qubit_RNA_crab_isos


NanoDrop Spec Curves:


NanoDrop Table:


Overall, the isolation looks pretty good. The purity looks good (NanoDrop 260/280 ratios) and the absorbance peak at 260nm is exactly where we would want/expect it to be.

The yields (according to the Qubit) are OK. They range from ~37ng – 350ng.

The important part is that this method produced clean RNA, which means the quantification is believable. I think Grace’s earlier RNA isolations using RNAzol RT had too much contamination carried over, leading to incorrect quantification measurements.

Going forward, I think we need to use some sort of isolation kit, however, we will be testing out good, old TriReagent as well.

Mox – Password-less SSH!

The high performance computing (HPC) cluster (called Mox) at Univ. of Washington (UW) frustratingly requires a password when SSH-ing, even when SSH keys are in use. I have a lengthy, unintelligable password that I use for my UW account, so having to type this in any time I want to initiate a new SSH session on Mox is a painful process.

Today, I finally got fed up with how much time I was wasting (granted, it’s minor in the grand scheme of my day) just logging in to Mox, so I spent some time figuring out how to automate password entry for a new SSH session with Mox.

I tried to handle this using the program sshpass, but I couldn’t get it to read my password from a file – it would just hang in limbo after executing the command.

In the end, I came across a bash script that does this perfectly. Steps to implement this on Ubuntu 16.04 LTS:

  1. Install expect:
    sudo apt install expect
  2. Create following script (taken from this [StackExchange solution])(https://unix.stackexchange.com/questions/31071/shell-script-for-logging-into-a-ssh-server):
    
    #!/usr/bin/expect
    
    spawn ssh mox
    expect "Password:"
    send "\r"
    interact
    

    NOTES:

    • I have an ~/.ssh/config file that allows me to use “mox” as an alias for my full SSH command

    • Replace with your own UW password.

  3. Change access to script (set read, write, execute for user only):

    chmod u=rwx,go-rwx
  4. Run script from home directory (saved in home directory):
    ./mox.sh

Boom! No having to track down password, copy, and paste!

Transposable Element Mapping – Olympia Oyster Genome Assembly, Olurida_v081, using RepeatMasker 4.07

I previously performed this analysis using a different version of our Ostrea lurida genome assembly. Steven asked that I repeat the analysis with a modified version of the genome assembly (Olurida_v081) – only has contigs >1000bp in length.

Genome used: Olurida_v081

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   17954347 bp ( 1.67 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            16599       978030 bp    0.09 %
      ALUs            1          292 bp    0.00 %
      MIRs          937        72873 bp    0.01 %

LINEs:             3279       752631 bp    0.07 %
      LINE1         172        10882 bp    0.00 %
      LINE2         646        67827 bp    0.01 %
      L3/CR1        659        60327 bp    0.01 %

LTR elements:       569       127808 bp    0.01 %
      ERVL           32         1949 bp    0.00 %
      ERVL-MaLRs     10          490 bp    0.00 %
      ERV_classI    165        17699 bp    0.00 %
      ERV_classII    26         1590 bp    0.00 %

DNA elements:      1911       161957 bp    0.02 %
     hAT-Charlie     74         4216 bp    0.00 %
     TcMar-Tigger   584        24985 bp    0.00 %

Unclassified:        78         9834 bp    0.00 %

Total interspersed repeats:  2030260 bp    0.19 %


Small RNA:         5592       409456 bp    0.04 %

Satellites:         117        21278 bp    0.00 %
Simple repeats:  270784     12935570 bp    1.20 %
Low complexity:   42130      2568284 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:  152816516 bp ( 14.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       193250     67253771 bp    6.24 %
   SINEs:             2087       284274 bp    0.03 %
   Penelope         158576     56080082 bp    5.21 %
   LINEs:           179430     61300904 bp    5.69 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        675       348273 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          7        10781 bp    0.00 %
     RTE/Bov-B        7051      1827344 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     11733      5668593 bp    0.53 %
     BEL/Pao          1517       871288 bp    0.08 %
     Ty1/Copia          78        72481 bp    0.01 %
     Gypsy/DIRS1      9151      4445789 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     233691     33727339 bp    3.13 %
   hobo-Activator    17578      1886743 bp    0.18 %
   Tc1-IS630-Pogo    39184      6403235 bp    0.59 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7261      1003937 bp    0.09 %
   Tourist/Harbinger  8635       823434 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       157855     36675484 bp    3.40 %

Total interspersed repeats:   137656594 bp   12.78 %


Small RNA:             222        72690 bp    0.01 %

Satellites:           6260      1238331 bp    0.11 %
Simple repeats:     241081     11662466 bp    1.08 %
Low complexity:      38915      2347827 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   36996910 bp ( 3.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        59806      9886111 bp    0.92 %
   SINEs:            59806      9886111 bp    0.92 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       8720      2230426 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        47005      9434652 bp    0.88 %

Total interspersed repeats:    21551189 bp    2.00 %


Small RNA:           60030      9959172 bp    0.92 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     259134     12795379 bp    1.19 %
Low complexity:      42184      2581162 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   15918797 bp ( 1.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             224        73061 bp    0.01 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     273098     13256460 bp    1.23 %
Low complexity:      42443      2592212 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+



Library Construction – Geoduck Water Filter Metagenome with Nextera DNA Flex Kit (Illumina)

Made Illumina libraries with goeduck metagenome water filter DNA I previously isolated on:

We used a free Nextera DNA Flex Kit (Illumina) that we won in a contest held by Illumina!

Followed the manufacturer’s protocol for input DNA quantities <10ng with the following changes/notes:

  • PCR steps performed in 200uL thin-walled PCR tubes.

  • Magnetic separations were performed in 1.7mL snap cap tubes.

  • Thermalcycler: PTC-200 (MJ Research)

  • Magnet: DynaMag 2 (Invitrogen)

See the Library Calcs sheet (link below) for original sample names and subsequent library sample names.

IMPORTANT!

The sheet also contains the indexes used for each library. This info will be necessary for sequencing facility.

Library Calcs (Google Sheet):

Links to the Illumina manuals are below:

After library construction was completed, individual libraries were quantified on the Roberts Lab Qubit 3.0 (Invitrogen) with the Qubit 1x dsDNA HS Assay Kit.

2uL of each sample was used for each assay.

Library quality was assessed using the Seeb Lab 2100 Bioanalyzer (Agilent) with a High Sensitivity DNA Kit, using 1uL of each sample.

Libraries were stored in the small -20C in FTR213:


Results:

Qubit Raw Data (Google Sheet):

Bioanalyzer File (XAD):

All libraries have DNA in them, so that’s good!

Except for one library (Library Geoduck MG #04 is bad), the other libraries look OK (i.e. not great). Compared to the example on Pg. 12 in the manual, these libraries all have some extra high molecular weight stuff.

When selecting the range listed in the Nextera Kit manual, the average fragment size is ~530bp – the expected size should be ~600bp.

Spoke with Steven about Library Geoduck MG #04 and we’ve opted to just leave it out.

All other samples were pooled into a single samples according to the manufacturer’s protocol.

This pooled sample was stored in the same -20C box as above, in position I4.


UPDATE 20180808

After some confusion with the sequencing facility, I contacted Illumina regarding adapter sequences. I used the sequences provided for the Nextera DNA 24 CD Indexes (which was the index kit we used) on p.18 of the Illumina Index Adapter Pooling Guide.

As it turns out, these sequences are incorrect. The correct sequences are on p.12 of that document (the Nextera DNA 96 CD Indexes).

I’ve updated the Google Sheet (linked above) to reflect the correct index sequences.

Email from Illumina is below. Even though he specifically references the H705 adapter, the correct sequence information for all i7 index adapters is found on p.12.

Hi Sam,

Thanks for the clarification! For the index sequence H705, this sequence is incorrect in the Index Adapters Pooling Guide. The correct information is found on page 12 of the same document and should be:

H705 “AGGAGTCC” (Bases in Adapter) and “GGACTCCT” (bases for sample sheet.

This is also consistent with the Illumina Adapters letter.

We have provided this feed back to our colleagues to update the document so that all the information is consistent.

Thanks for your patience and understanding while we evaluated this issue. If we do have any other questions or concerns, please let us know and we would be happy to discuss this further.

Best,

Russell

Russell Chan, Ph.D.

Technical Applications Scientist

Illumina Technical Support

Telephone available 24 hours

Monday through Friday

Technical Bulletins: https://support.illumina.com/bulletins.html

Trainings: http://support.illumina.com/traidexes

BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

Will take the full data sets through this whole pipeline.