RNA Cleanup – Tanner Crab RNA Pools

Grace had previously pooled a set of crab RNA in preparation for RNAseq. Yesterday, we/she concentrated the samples and then quantified them. Unfortunately, Qubit results were not good (concentrations were far below the expected 20ng/uL) and the NanoDrop1000 results yielded awful looking curves.

In an attempt to figure out what was wrong, I decided to use the RNeasy Plus Mini Kit (Qiagen) on the three pools. I did this due to the poor spec curves seen in the NanoDrop1000 measurements. Additionally, all of the RNA pools had undissolved/insoluble bits floating around in them. My thinking was that excess contaminants/salts could be interfering with the Qubit assay. Removing these could/should enlighten us as to what the issue might be.

Followed the manufacturer’s protocol for RNeasy MiniElute Cleanup Kit (as the RNeasy Plus Mini Kit uses the same reagents/columns for RNA purification) for samples with <100uL.

Samples were quantified on the RobertsLab NanoDrop1000 (ThermoFisher) and the Qubit 3.0 (ThermoFisher) using the RNA high sensitivity (HS) Kit. Used 1uL of each sample.

Results:

Qubit (Google Sheet): 20180719_qubit_RNA_crab_pools

NanoDrop:

The NanoDrop did not detect any RNA in the samples.

The Qubit did not detect any RNA in Crab Pool 1. The other two samples had similar concentrations (~7ng/uL). This would mean a total of ~84ng of RNA was present in each of those two samples.

All pools were expected to have well over 1000ng of RNA.

Will have to think about what should be done, but I would lean towards attempting to run some “test” samples through the RNeasy Cleanup kit to see if that would help get us more accurate Qubit readings? I don’t know, though…

Mox – Olympia oyster genome annotation progress (using Maker 2.31.10)

TL;DR – It appears to be continuing where it left off!

I decided to spend some time to figure out what was actually happening, as it’s clear that the annotation process is going to need some additional time to run and may span an additional monthly maintenance shutdown.

This is great, because, otherwise, this will take an eternity to actually complete (particularly because we’d have to move the job to run on one of our lab’s computers – which pale in comparison to the specs of our Mox nodes).

However, it’s a bit shocking that this is taking this long, even on a Mox node!

I started annotating the Olympia oyster genome on 20180529. Since then, the job has been interrupted twice by monthly Mox maintenance (which happens on the 2nd Tuesday of each month). Additionally, when this happens, the SLURM output file is overwritten, making it difficult to assess whether or not Maker continues where it left off or if it’s starting over from scratch.

Anyway, here’s how I deduced that the program is continuing where it left off.

  1. I figured out that it produces a generic feature format (GFF) file for each contig.

  2. Decided to search for the first contig GFF and look at it’s last modified date. This would tell me if it was newly generated (i.e. on the date that the job was restarted after the maintenance shutdown) or if it was old. Additionally, if there were more than one of these files, then I’d also know that Maker was just starting at the beginning and writing data to a different location.

    This shows:

    1. Only one copy of Contig0.gff exists.

    2. Last modified date is 20180530.

  3. Check the slurm output file for info.

    This reveals this important piece of info:

    MAKER WARNING: The file 20180529_oly_annotation_01.maker.output/20180529_oly_annotation_01_datastore/AC/68/Contig215522//theVoid.Contig215522/0/Contig215522.0.all.rb.out
    did not finish on the last run

All of these taken together lead me to confidently conclude that Maker is not restarting from the beginning and is, indeed, continuing where it left off. WHEW!

Just for kicks, I also ran a count of GFF files to see where this stands so far:

Wow! 622,010 GFFs!!!

Finally, for posterity, here’s the SLURM script I used to submit this job, back in May! I’ll have all of the corresponding genome files, proteome files, transcriptome files, etc. on one of our servers once the job completes.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180529_oly_maker_genome_annotation
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/sam/outputs/20180529_oly_maker_genome_annotation

## Establish variables for more readable code

### Path to Maker executable
maker=/gscratch/srlab/programs/maker-2.31.10/bin/maker

### Path to Olympia oyster genome FastA file
oly_genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/jelly.out.fasta

### Path to Olympia oyster transcriptome FastA file
oly_transcriptome=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta

### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa

### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa

## Create Maker control files needed for running Maker
$maker -CTL

## Store path to options control file
maker_opts_file=./maker_opts.ctl

## Create combined proteome FastA file
touch gigas_virginica_ncbi_proteomes.fasta
cat "$gigas_proteome" >> gigas_virginica_ncbi_proteomes.fasta
cat "$virginica_proteome" >> gigas_virginica_ncbi_proteomes.fasta

## Edit options file

### Set paths to O.lurida genome and transcriptome.
### Set path to combined C. gigas and C.virginica proteomes.
## The use of the % symbol sets the delimiter sed uses for arguments.
## Normally, the delimiter that most examples use is a slash "/".
## But, we need to expand the variables into a full path with slashes, which screws up sed.
## Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^genome=/ s% %$oly_genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$oly_transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$gigas_virginica_ncbi_proteomes %" "$maker_opts_file"

## Run Maker
### Set basename of files and specify number of CPUs to use
$maker \
-base 20180529_oly_annotation_01 \
-cpus 24

Mox – Password-less SSH!

The high performance computing (HPC) cluster (called Mox) at Univ. of Washington (UW) frustratingly requires a password when SSH-ing, even when SSH keys are in use. I have a lengthy, unintelligable password that I use for my UW account, so having to type this in any time I want to initiate a new SSH session on Mox is a painful process.

Today, I finally got fed up with how much time I was wasting (granted, it’s minor in the grand scheme of my day) just logging in to Mox, so I spent some time figuring out how to automate password entry for a new SSH session with Mox.

I tried to handle this using the program sshpass, but I couldn’t get it to read my password from a file – it would just hang in limbo after executing the command.

In the end, I came across a bash script that does this perfectly. Steps to implement this on Ubuntu 16.04 LTS:

  1. Install expect:
    sudo apt install expect
  2. Create following script (taken from this [StackExchange solution])(https://unix.stackexchange.com/questions/31071/shell-script-for-logging-into-a-ssh-server):
    
    #!/usr/bin/expect
    
    spawn ssh mox
    expect "Password:"
    send "\r"
    interact
    

    NOTES:

    • I have an ~/.ssh/config file that allows me to use “mox” as an alias for my full SSH command

    • Replace with your own UW password.

  3. Change access to script (set read, write, execute for user only):

    chmod u=rwx,go-rwx
  4. Run script from home directory (saved in home directory):
    ./mox.sh

Boom! No having to track down password, copy, and paste!

Ubuntu – Fix “No Video Signal” Issue on Emu/Roadrunner

Both Apple Xserves (Emu/Roadrunner) running Ubuntu (16.04LTS) experienced the same issue – the monitor would indicate “No Video Signal”, would go dark, and wasn’t responsive to keyboard/mouse movements. However, you could ssh into both machines w/o issue.

Although having these machines be “headless” (i.e. with no display) is usually fine, it’s not ideal for a couple of reasons:

  1. Difficult to use for other lab members who aren’t as familiar with SSH – specifically if they would want to use a Jupyter Notebook remotely (this would require setting up a tunnel to their own computer).

  2. Can’t use Remmina Remote Desktop until a user has physically logged in from the Ubuntu login screen at least once, in order to launch Remmina.

The second aspect was the major impetus in me finally being motivated to deal with this. Accessing these computers via remote desktop is much easier to manage long-running Jupyter Notebooks instead of relying on an SSH tunnel. The tunnel greatly limits my access to the Jupyter Notebook outside of the computer that has the tunnel set up.

Well, this led me down a horrible rabbit hole of Linux stuff that I won’t get fully in to (particularly, since I didn’t understand most of it and can’t remember all the crazy stuff I read/tried).

However, here’s the gist:

  1. Needed to edit /etc/default/grub

  2. After editing, needed to update grub config file: sudo update-grub

Despite the fact that both machines are (or, should be) identical, I did not get the same results. The edits I made to the /etc/default/grub file on Emu worked immediately. The edits were:

  1. Add nomodeset to this (this is the edited line) line (this seemed to be the most common suggestion for fixing the “No Video Signal” issue):

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nomodeset"

  1. Comment out this line (this line was triggering an error/warning about writing the config file when running the update-grub command):

#GRUB_HIDDEN_TIMEOUT=0

For some reason, Roadrunner did not take kindly to those changes and it took a long time to resolve, ending with changing permissions on ~/.Xauthority back to their original permissions (they got altered when I ran some command – sudo startx or something) to get out of a login loop.

Regardless, both are fixed, both can be used when physically sitting at the computer, and both can be accessed remotely using Remmina!

Transposable Element Mapping – Olympia Oyster Genome Assembly, Olurida_v081, using RepeatMasker 4.07

I previously performed this analysis using a different version of our Ostrea lurida genome assembly. Steven asked that I repeat the analysis with a modified version of the genome assembly (Olurida_v081) – only has contigs >1000bp in length.

Genome used: Olurida_v081

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   17954347 bp ( 1.67 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            16599       978030 bp    0.09 %
      ALUs            1          292 bp    0.00 %
      MIRs          937        72873 bp    0.01 %

LINEs:             3279       752631 bp    0.07 %
      LINE1         172        10882 bp    0.00 %
      LINE2         646        67827 bp    0.01 %
      L3/CR1        659        60327 bp    0.01 %

LTR elements:       569       127808 bp    0.01 %
      ERVL           32         1949 bp    0.00 %
      ERVL-MaLRs     10          490 bp    0.00 %
      ERV_classI    165        17699 bp    0.00 %
      ERV_classII    26         1590 bp    0.00 %

DNA elements:      1911       161957 bp    0.02 %
     hAT-Charlie     74         4216 bp    0.00 %
     TcMar-Tigger   584        24985 bp    0.00 %

Unclassified:        78         9834 bp    0.00 %

Total interspersed repeats:  2030260 bp    0.19 %


Small RNA:         5592       409456 bp    0.04 %

Satellites:         117        21278 bp    0.00 %
Simple repeats:  270784     12935570 bp    1.20 %
Low complexity:   42130      2568284 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:  152816516 bp ( 14.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       193250     67253771 bp    6.24 %
   SINEs:             2087       284274 bp    0.03 %
   Penelope         158576     56080082 bp    5.21 %
   LINEs:           179430     61300904 bp    5.69 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        675       348273 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          7        10781 bp    0.00 %
     RTE/Bov-B        7051      1827344 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     11733      5668593 bp    0.53 %
     BEL/Pao          1517       871288 bp    0.08 %
     Ty1/Copia          78        72481 bp    0.01 %
     Gypsy/DIRS1      9151      4445789 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     233691     33727339 bp    3.13 %
   hobo-Activator    17578      1886743 bp    0.18 %
   Tc1-IS630-Pogo    39184      6403235 bp    0.59 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7261      1003937 bp    0.09 %
   Tourist/Harbinger  8635       823434 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       157855     36675484 bp    3.40 %

Total interspersed repeats:   137656594 bp   12.78 %


Small RNA:             222        72690 bp    0.01 %

Satellites:           6260      1238331 bp    0.11 %
Simple repeats:     241081     11662466 bp    1.08 %
Low complexity:      38915      2347827 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   36996910 bp ( 3.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        59806      9886111 bp    0.92 %
   SINEs:            59806      9886111 bp    0.92 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       8720      2230426 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        47005      9434652 bp    0.88 %

Total interspersed repeats:    21551189 bp    2.00 %


Small RNA:           60030      9959172 bp    0.92 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     259134     12795379 bp    1.19 %
Low complexity:      42184      2581162 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   15918797 bp ( 1.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             224        73061 bp    0.01 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     273098     13256460 bp    1.23 %
Low complexity:      42443      2592212 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+



Library Construction – Geoduck Water Filter Metagenome with Nextera DNA Flex Kit (Illumina)

Made Illumina libraries with goeduck metagenome water filter DNA I previously isolated on:

We used a free Nextera DNA Flex Kit (Illumina) that we won in a contest held by Illumina!

Followed the manufacturer’s protocol for input DNA quantities <10ng with the following changes/notes:

  • PCR steps performed in 200uL thin-walled PCR tubes.

  • Magnetic separations were performed in 1.7mL snap cap tubes.

  • Thermalcycler: PTC-200 (MJ Research)

  • Magnet: DynaMag 2 (Invitrogen)

See the Library Calcs sheet (link below) for original sample names and subsequent library sample names.

IMPORTANT!

The sheet also contains the indexes used for each library. This info will be necessary for sequencing facility.

Library Calcs (Google Sheet):

Links to the Illumina manuals are below:

After library construction was completed, individual libraries were quantified on the Roberts Lab Qubit 3.0 (Invitrogen) with the Qubit 1x dsDNA HS Assay Kit.

2uL of each sample was used for each assay.

Library quality was assessed using the Seeb Lab 2100 Bioanalyzer (Agilent) with a High Sensitivity DNA Kit, using 1uL of each sample.

Libraries were stored in the small -20C in FTR213:


Results:

Qubit Raw Data (Google Sheet):

Bioanalyzer File (XAD):

All libraries have DNA in them, so that’s good!

Except for one library (Library Geoduck MG #04 is bad), the other libraries look OK (i.e. not great). Compared to the example on Pg. 12 in the manual, these libraries all have some extra high molecular weight stuff.

When selecting the range listed in the Nextera Kit manual, the average fragment size is ~530bp – the expected size should be ~600bp.

Spoke with Steven about Library Geoduck MG #04 and we’ve opted to just leave it out.

All other samples were pooled into a single samples according to the manufacturer’s protocol.

This pooled sample was stored in the same -20C box as above, in position I4.

BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

Will take the full data sets through this whole pipeline.

Transposable Element Mapping – Crassostrea virginica NCBI Genome Assembly using RepeatMasker 4.07

Genome used: NCBI GCA_002022765.4_C_virginica-3.0

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 with species set to Crassotrea virginica.

All commands were documented in a Jupyter Notebook (GitHub):


RESULTS:

Output folder:

Output table (GFF):

Summary table (text):

==================================================
file name: GCF_002022765.2_C_virginica-3.0_genomic.fasta
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Transposable Element Mapping – Olympia Oyster Genome Assembly using RepeatMasker 4.07

Steven wanted transposable elements (TEs) in the Olympia oyster genome identified.

After some minor struggles, I was able to get RepeatMasker installed on on both of our Apple Xserves (emu & roadrunner; running Ubuntu 16.04LTS).

Genome used: pbjelly_sjw_01

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   20002806 bp ( 1.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            17794      1061170 bp    0.09 %
      ALUs          363        31340 bp    0.00 %
      MIRs         1166        92129 bp    0.01 %

LINEs:             4456       888114 bp    0.08 %
      LINE1         976       103929 bp    0.01 %
      LINE2         813        82891 bp    0.01 %
      L3/CR1        699        63627 bp    0.01 %

LTR elements:      1187       199118 bp    0.02 %
      ERVL          155        15828 bp    0.00 %
      ERVL-MaLRs    200        20737 bp    0.00 %
      ERV_classI    379        42833 bp    0.00 %
      ERV_classII    66         6896 bp    0.00 %

DNA elements:      2290       196866 bp    0.02 %
     hAT-Charlie    190        15468 bp    0.00 %
     TcMar-Tigger   732        37473 bp    0.00 %

Unclassified:       101        12946 bp    0.00 %

Total interspersed repeats:  2358214 bp    0.20 %


Small RNA:         5954       433422 bp    0.04 %

Satellites:         366        55705 bp    0.00 %
Simple repeats:  310641     14322152 bp    1.22 %
Low complexity:   47381      2844279 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:  160759267 bp ( 13.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       213132     69887654 bp    5.96 %
   SINEs:             2374       311974 bp    0.03 %
   Penelope         171792     57862186 bp    4.94 %
   LINEs:           195605     63430615 bp    5.41 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        731       357995 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL         13        11377 bp    0.00 %
     RTE/Bov-B        8085      1948581 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     15153      6145065 bp    0.52 %
     BEL/Pao          2119       955773 bp    0.08 %
     Ty1/Copia         101        75372 bp    0.01 %
     Gypsy/DIRS1     11776      4815361 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     256292     35689117 bp    3.04 %
   hobo-Activator    19847      2059651 bp    0.18 %
   Tc1-IS630-Pogo    43269      6806311 bp    0.58 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7935      1060296 bp    0.09 %
   Tourist/Harbinger  9503       887332 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       174943     38299211 bp    3.27 %

Total interspersed repeats:   143875982 bp   12.27 %


Small RNA:             280        78768 bp    0.01 %

Satellites:           7383      1362194 bp    0.12 %
Simple repeats:     278809     12982714 bp    1.11 %
Low complexity:      44078      2622506 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   39598953 bp ( 3.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        63882     10327611 bp    0.88 %
   SINEs:            63882     10327611 bp    0.88 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       9433      2307292 bp    0.20 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        51558      9836468 bp    0.84 %

Total interspersed repeats:    22471371 bp    1.92 %


Small RNA:           64164     10406776 bp    0.89 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     298612     14185090 bp    1.21 %
Low complexity:      47510      2866522 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   17617763 bp ( 1.50 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             282        79165 bp    0.01 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     313082     14662647 bp    1.25 %
Low complexity:      47785      2878201 bp    0.25 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+