Tag Archives: Ostrea lurida

DNA Methylation Analysis – Bismark Pipeline on All Olympia oyster BSseq Datasets

Bismark analysis of all of our current Olympia oyster (Ostrea lurida) DNA methylation high-throughput sequencing data.

Analysis was run on Emu (Ubuntu 16.04LTS, Apple Xserve). The primary analysis took ~14 days to complete.

All operations are documented in a Jupyter notebook (GitHub):

Genome used:


Input files ( see Olympia oyster Genomic GitHub wiki for more info ):

WG BSseq of Fidalgo Bay offspring grown in Fidalgo Bay & Oyster Bay
  • 1_ATCACG_L001_R1_001.fastq.gz

  • 2_CGATGT_L001_R1_001.fastq.gz

  • 3_TTAGGC_L001_R1_001.fastq.gz

  • 4_TGACCA_L001_R1_001.fastq.gz

  • 5_ACAGTG_L001_R1_001.fastq.gz

  • 6_GCCAAT_L001_R1_001.fastq.gz

  • 7_CAGATC_L001_R1_001.fastq.gz

  • 8_ACTTGA_L001_R1_001.fastq.gz

MBDseq of two populations (Hood Canal & Oyster Bay) grown in Clam Bay
  • zr1394_10_s456.fastq.gz

  • zr1394_11_s456.fastq.gz

  • zr1394_12_s456.fastq.gz

  • zr1394_13_s456.fastq.gz

  • zr1394_14_s456.fastq.gz

  • zr1394_15_s456.fastq.gz

  • zr1394_16_s456.fastq.gz

  • zr1394_17_s456.fastq.gz

  • zr1394_18_s456.fastq.gz

  • zr1394_1_s456.fastq.gz

  • zr1394_2_s456.fastq.gz

  • zr1394_3_s456.fastq.gz

  • zr1394_4_s456.fastq.gz

  • zr1394_5_s456.fastq.gz

  • zr1394_6_s456.fastq.gz

  • zr1394_7_s456.fastq.gz

  • zr1394_8_s456.fastq.gz

  • zr1394_9_s456.fastq.gz


RESULTS:

With Bismark complete, these two sets of analyses can now be looked into further (and separately, as they are separate experiments) using things like MethylKit (R package) and
the Integrative Genomics Viewer (IGV).

Output folder:

Bismark Summary Report:

Individual Sample Reports:

Genome Annoation – Olympia oyster genome annotation results #02

Yesterday, I annotated our Olympia oyster genome using WQ-MAKER in just 7hrs!.

See that link for run setup and configuration. They are essentially the same, except for the change I’ll discuss below.

The results from that run can be seen here:

In that previous run, I neglected to provide a transposable elements FastA file for use with RepeatMasker.

I remedied that and re-ran it. I modified maker_opts.ctl to include the following:

repeat_protein=../../opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner

This TEs file is part of RepeatMasker.


RESULTS

Output folder:

Annotated genome file (GFF):



This run took about an hour longer than the previous run, but for some reason it ran with only 21 workers, instead of 22. This is probably the reason for the increased run time.

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical to start working with will be those scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with NCBI protein IDs will give us a basic annotaiton of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the previous run where I did not provide the transposable elements FastA file.

Genome Annoation – Olympia oyster genome annotation results #01

Yesterday, I annotated our Olympia oyster genome using WQ-MAKER in just 7hrs!.

See that link for run setup and configuration.


RESULTS

Before proceeding further, it should be noted that I neglected to provide Maker with a transposable elements FastA file for RepeatMasker to use.

The following line in the maker_opts.ctl file was originally populated with an absolute path to data I didn’t recognize, so I removed it:

repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner

I’m not entirely sure what the impacts will be on annotation, so I’ve re-run Maker with that line restored (using a relative path). You can find the results of that run here:

  • [Genome Annoation – Olympia oyster genome annotation results #02](http://onsnetwork.org/kubu4/2018/08/08/genome-annoation-olympia-oyster-genome-annotation-results-02/

Output folder:

Annotated genome file (GFF):

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical approach from here is to start working with scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with a database of NCBI protein IDs will give us a basic annotation of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the follow up run where I provided the transposable elements FastA file to see what impacts the exclusion/inclusion of that info had on annotation.

Genome Annotation – Olympia oyster genome using WQ-MAKER Instance on Jetstream

Yesterday, our Xsede Startup Application (Google Doc) got approval for 100,000 Service Units (SUs) and 1TB of disk space on Xsede/Atmosphere/Jetstream (or, whatever it’s actually called!). The approval happened within an hour of submitting the application!

Here’s a copy of the approval notice:

Dear Dr. Roberts:

Your recently submitted an XSEDE Startup request has been reviewed and approved.

PI: Steven Roberts, University of Washington

Request: Annotation of Olympia oyster (Ostrea lurida) and Pacific geoduck (Panopea generosa) genomes using WQ_MAKER on Jetstream cloud.

Request Number: MCB180124 (New)

Start Date: N/A

End Date: 2019-08-05

Awarded Resources: IU/TACC (Jetstream): 100,000.0 SUs

IU/TACC Storage (Jetstream Storage): 1,000.0 GB

Allocations Admin Comments:

The estimated value of these awarded resources is $14,890.00. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for U.S. The dollar value of your allocation is estimated from the NSF awards supporting the allocated resources.

If XSEDE Extended Collaborative Support (ECSS) assistance was recommended by the review panel, you will be contacted by the ECSS team within the next two weeks to begin discussing this collaboration.

For details about the decision and reviewer comments, please see below or go to the XSEDE User Portal (https://portal.xsede.org), login, click on the ALLOCATIONS tab, then click on Submit/Review Request. Once there you will see your recently awarded research request listed on the right under the section ‘Approved’. Please select the view action to see reviewer comments along with the notes from the review meeting and any additional comments from the Allocations administrator.

By default the PI and all co-PIs will be added to the resources awarded. If this is an award on a renewal request, current users will have their account end dates modified to reflect the new end date of this award. PIs, co-PIs, or Allocation Managers can add users to or remove users from resources on this project by logging into the portal (https://portal.xsede.org) and using the ‘Add/Remove User’ form.

Share the impact of XSEDE! In exchange for access to the XSEDE ecosystem, we ask that all users let us know what XSEDE has helped you achieve:

  • For all publications, please acknowledge use of XSEDE and allocated resources by citing the XSEDE paper (https://www.xsede.org/how-to-acknowledge-xsede) and also add your publications to your user profile.
  • Tell us about your achievements (http://www.xsede.org/group/xup/science-achievements).
  • Help us improve our reporting by keeping your XSEDE user profile up to date and completing the demographic fields (https://portal.xsede.org/group/xup/profile).

For question regarding this decision, please contact help@xsede.org.

Best regards,
XSEDE Resource Allocations Service

===========================

REVIEWER COMMENTS

===========================

Review #0 – Excellent

Assessment and Summary:
Maker is a well-known and fitting workflow for Jetstream. This should be a good use of resources.

Appropriateness of Methodology:

After each user on the allocation has logged in, the PI will need to open a ticket via help@xsede.org and request the quota be set per the allocation to 1TB. Please provide the XSEDE portal ID of each user.

Appropriateness of Computational Research Plan:

We request that any publications stemming from work done using Jetstream cite us – https://jetstream-cloud.org/research/citing-jetstream.php

Efficient Use of Resources:


We had a tremendous amount of help from Upendra Devisetty at CyVerse in getting the Xsede Startup Application written, as well as running WQ-MAKER on Xsede/Atmosphere/Jetstream (or, whatever it’s called!).


Now, on to how I got the run going…

I initiated the Olympia oyster genome annotation using a WQ-MAKER instance (MAKER 2.31.9 with CCTools v3.1) on Jetstream:


I followed the excellent step-by-step directions here:

The “MASTER” machine was a “m1.xlarge” machine (i.e. CPU: 24, Mem: 60 GB, Disk: 60 GB).

I attached a 1TB volume to the MASTER machine.

I set up the run using 21 “WORKER” machines. Twenty WORKERS were “m1.large” machines (i.e. CPU: 10, Mem: 30 GB, Disk: 60 GB). The remaining WORKER was set at “m1.xlarge” (i.e. CPU: 24, Mem: 60 GB, Disk: 60 GB) to use up the rest of our allocated memory.


MASTER machine was initialized with the following command:

nohup wq_maker -contigs-per-split 1 -cores 1 -memory 2048 -disk 4096 -N wq_maker_oly${USER} -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt > log_file.txt 2>&1 &

Each WORKER was started with the following command:

nohup work_queue_worker -N wq_maker_oly${USER} --cores all --debug-rotate-max=0 -d all -o worker.dbg > log_file_2.txt 2>&1 &

When starting each WORKER, an error message was generated, but this doesn’t seem to have any impact on the ability of the program to run:


I checked on the status of the run and you can see that there are 22 WORKERS and 15,568 files “WAITING”. BUT, there are 159,429 contigs in our genome FastA!

Why don’t these match??!!

This is because WQ-MAKER splits the genome FastA into smaller FastA files containg only 10 sequences each. This is why we see 10-fold fewer files being processed than sequences in our genome file.


Here is the rest of the nitty gritty details:

Genome file:
Transcriptome file:
Proteome files (NCBI):

The two FastA files below were concatenated into a single FastA file (gigas_virginica_ncbi_proteomes.fasta) for use in WQ-MAKER.

Maker options control file (maker_opts.ctl):

This is a bit difficult to read in this notebook; copy and paste in text editor for easier viewing.

NOTE: Paths to files (e.g. genome FastA) have to be relative paths; cannot be absolute paths!


#-----Genome (these are always required)
genome=../data/oly_genome/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=../data/oly_transcriptome/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=../data/gigas_virginica_ncbi_proteomes.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

After the run finishes, it will have produced a corresponding GFF file for each scaffold. This is unwieldly, so the GFFs will be merged using the following code:

$ gff3_merge -n -d Olurida_v081.maker.output/Olurida_v081_master_datastore_index.log

All WORKERS were running as of 07:45 today. As of this posting (~3hrs later), WQ-MAKER had already processed ~45% of the files! Annotation will be finished by the end of today!!! Crazy!

Mox – Over quota: Olympia oyster genome annotation progress (using Maker 2.31.10)

UPDATE 20180730

Per this GitHub Issue, Steven has cleared out files.

Additionally, it appears that my job has just continued from where it was stuck. Very nice!

ORIGINAL POST IS BELOW


Well, this is an issue. Checked in on job progress and noticed that we’ve exceeded our storage quota on Mox:



Will have post an issue on GitHub to help get unnecessary files cleaned out. Kind of shocking that we’re using >6TB of space already…

Mox – Olympia oyster genome annotation progress (using Maker 2.31.10)

TL;DR – It appears to be continuing where it left off!

I decided to spend some time to figure out what was actually happening, as it’s clear that the annotation process is going to need some additional time to run and may span an additional monthly maintenance shutdown.

This is great, because, otherwise, this will take an eternity to actually complete (particularly because we’d have to move the job to run on one of our lab’s computers – which pale in comparison to the specs of our Mox nodes).

However, it’s a bit shocking that this is taking this long, even on a Mox node!

I started annotating the Olympia oyster genome on 20180529. Since then, the job has been interrupted twice by monthly Mox maintenance (which happens on the 2nd Tuesday of each month). Additionally, when this happens, the SLURM output file is overwritten, making it difficult to assess whether or not Maker continues where it left off or if it’s starting over from scratch.

Anyway, here’s how I deduced that the program is continuing where it left off.

  1. I figured out that it produces a generic feature format (GFF) file for each contig.

  2. Decided to search for the first contig GFF and look at it’s last modified date. This would tell me if it was newly generated (i.e. on the date that the job was restarted after the maintenance shutdown) or if it was old. Additionally, if there were more than one of these files, then I’d also know that Maker was just starting at the beginning and writing data to a different location.

    This shows:

    1. Only one copy of Contig0.gff exists.

    2. Last modified date is 20180530.

  3. Check the slurm output file for info.

    This reveals this important piece of info:

    MAKER WARNING: The file 20180529_oly_annotation_01.maker.output/20180529_oly_annotation_01_datastore/AC/68/Contig215522//theVoid.Contig215522/0/Contig215522.0.all.rb.out
    did not finish on the last run

All of these taken together lead me to confidently conclude that Maker is not restarting from the beginning and is, indeed, continuing where it left off. WHEW!

Just for kicks, I also ran a count of GFF files to see where this stands so far:

Wow! 622,010 GFFs!!!

Finally, for posterity, here’s the SLURM script I used to submit this job, back in May! I’ll have all of the corresponding genome files, proteome files, transcriptome files, etc. on one of our servers once the job completes.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180529_oly_maker_genome_annotation
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/sam/outputs/20180529_oly_maker_genome_annotation

## Establish variables for more readable code

### Path to Maker executable
maker=/gscratch/srlab/programs/maker-2.31.10/bin/maker

### Path to Olympia oyster genome FastA file
oly_genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/jelly.out.fasta

### Path to Olympia oyster transcriptome FastA file
oly_transcriptome=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta

### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa

### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa

## Create Maker control files needed for running Maker
$maker -CTL

## Store path to options control file
maker_opts_file=./maker_opts.ctl

## Create combined proteome FastA file
touch gigas_virginica_ncbi_proteomes.fasta
cat "$gigas_proteome" >> gigas_virginica_ncbi_proteomes.fasta
cat "$virginica_proteome" >> gigas_virginica_ncbi_proteomes.fasta

## Edit options file

### Set paths to O.lurida genome and transcriptome.
### Set path to combined C. gigas and C.virginica proteomes.
## The use of the % symbol sets the delimiter sed uses for arguments.
## Normally, the delimiter that most examples use is a slash "/".
## But, we need to expand the variables into a full path with slashes, which screws up sed.
## Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^genome=/ s% %$oly_genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$oly_transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$gigas_virginica_ncbi_proteomes %" "$maker_opts_file"

## Run Maker
### Set basename of files and specify number of CPUs to use
$maker \
-base 20180529_oly_annotation_01 \
-cpus 24

Transposable Element Mapping – Olympia Oyster Genome Assembly, Olurida_v081, using RepeatMasker 4.07

I previously performed this analysis using a different version of our Ostrea lurida genome assembly. Steven asked that I repeat the analysis with a modified version of the genome assembly (Olurida_v081) – only has contigs >1000bp in length.

Genome used: Olurida_v081

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   17954347 bp ( 1.67 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            16599       978030 bp    0.09 %
      ALUs            1          292 bp    0.00 %
      MIRs          937        72873 bp    0.01 %

LINEs:             3279       752631 bp    0.07 %
      LINE1         172        10882 bp    0.00 %
      LINE2         646        67827 bp    0.01 %
      L3/CR1        659        60327 bp    0.01 %

LTR elements:       569       127808 bp    0.01 %
      ERVL           32         1949 bp    0.00 %
      ERVL-MaLRs     10          490 bp    0.00 %
      ERV_classI    165        17699 bp    0.00 %
      ERV_classII    26         1590 bp    0.00 %

DNA elements:      1911       161957 bp    0.02 %
     hAT-Charlie     74         4216 bp    0.00 %
     TcMar-Tigger   584        24985 bp    0.00 %

Unclassified:        78         9834 bp    0.00 %

Total interspersed repeats:  2030260 bp    0.19 %


Small RNA:         5592       409456 bp    0.04 %

Satellites:         117        21278 bp    0.00 %
Simple repeats:  270784     12935570 bp    1.20 %
Low complexity:   42130      2568284 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:  152816516 bp ( 14.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       193250     67253771 bp    6.24 %
   SINEs:             2087       284274 bp    0.03 %
   Penelope         158576     56080082 bp    5.21 %
   LINEs:           179430     61300904 bp    5.69 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        675       348273 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          7        10781 bp    0.00 %
     RTE/Bov-B        7051      1827344 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     11733      5668593 bp    0.53 %
     BEL/Pao          1517       871288 bp    0.08 %
     Ty1/Copia          78        72481 bp    0.01 %
     Gypsy/DIRS1      9151      4445789 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     233691     33727339 bp    3.13 %
   hobo-Activator    17578      1886743 bp    0.18 %
   Tc1-IS630-Pogo    39184      6403235 bp    0.59 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7261      1003937 bp    0.09 %
   Tourist/Harbinger  8635       823434 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       157855     36675484 bp    3.40 %

Total interspersed repeats:   137656594 bp   12.78 %


Small RNA:             222        72690 bp    0.01 %

Satellites:           6260      1238331 bp    0.11 %
Simple repeats:     241081     11662466 bp    1.08 %
Low complexity:      38915      2347827 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   36996910 bp ( 3.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        59806      9886111 bp    0.92 %
   SINEs:            59806      9886111 bp    0.92 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       8720      2230426 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        47005      9434652 bp    0.88 %

Total interspersed repeats:    21551189 bp    2.00 %


Small RNA:           60030      9959172 bp    0.92 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     259134     12795379 bp    1.19 %
Low complexity:      42184      2581162 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   15918797 bp ( 1.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             224        73061 bp    0.01 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     273098     13256460 bp    1.23 %
Low complexity:      42443      2592212 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+



BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

Will take the full data sets through this whole pipeline.

Transposable Element Mapping – Olympia Oyster Genome Assembly using RepeatMasker 4.07

Steven wanted transposable elements (TEs) in the Olympia oyster genome identified.

After some minor struggles, I was able to get RepeatMasker installed on on both of our Apple Xserves (emu & roadrunner; running Ubuntu 16.04LTS).

Genome used: pbjelly_sjw_01

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   20002806 bp ( 1.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            17794      1061170 bp    0.09 %
      ALUs          363        31340 bp    0.00 %
      MIRs         1166        92129 bp    0.01 %

LINEs:             4456       888114 bp    0.08 %
      LINE1         976       103929 bp    0.01 %
      LINE2         813        82891 bp    0.01 %
      L3/CR1        699        63627 bp    0.01 %

LTR elements:      1187       199118 bp    0.02 %
      ERVL          155        15828 bp    0.00 %
      ERVL-MaLRs    200        20737 bp    0.00 %
      ERV_classI    379        42833 bp    0.00 %
      ERV_classII    66         6896 bp    0.00 %

DNA elements:      2290       196866 bp    0.02 %
     hAT-Charlie    190        15468 bp    0.00 %
     TcMar-Tigger   732        37473 bp    0.00 %

Unclassified:       101        12946 bp    0.00 %

Total interspersed repeats:  2358214 bp    0.20 %


Small RNA:         5954       433422 bp    0.04 %

Satellites:         366        55705 bp    0.00 %
Simple repeats:  310641     14322152 bp    1.22 %
Low complexity:   47381      2844279 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:  160759267 bp ( 13.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       213132     69887654 bp    5.96 %
   SINEs:             2374       311974 bp    0.03 %
   Penelope         171792     57862186 bp    4.94 %
   LINEs:           195605     63430615 bp    5.41 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        731       357995 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL         13        11377 bp    0.00 %
     RTE/Bov-B        8085      1948581 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     15153      6145065 bp    0.52 %
     BEL/Pao          2119       955773 bp    0.08 %
     Ty1/Copia         101        75372 bp    0.01 %
     Gypsy/DIRS1     11776      4815361 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     256292     35689117 bp    3.04 %
   hobo-Activator    19847      2059651 bp    0.18 %
   Tc1-IS630-Pogo    43269      6806311 bp    0.58 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7935      1060296 bp    0.09 %
   Tourist/Harbinger  9503       887332 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       174943     38299211 bp    3.27 %

Total interspersed repeats:   143875982 bp   12.27 %


Small RNA:             280        78768 bp    0.01 %

Satellites:           7383      1362194 bp    0.12 %
Simple repeats:     278809     12982714 bp    1.11 %
Low complexity:      44078      2622506 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   39598953 bp ( 3.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        63882     10327611 bp    0.88 %
   SINEs:            63882     10327611 bp    0.88 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       9433      2307292 bp    0.20 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        51558      9836468 bp    0.84 %

Total interspersed repeats:    22471371 bp    1.92 %


Small RNA:           64164     10406776 bp    0.89 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     298612     14185090 bp    1.21 %
Low complexity:      47510      2866522 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   17617763 bp ( 1.50 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             282        79165 bp    0.01 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     313082     14662647 bp    1.25 %
Low complexity:      47785      2878201 bp    0.25 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+