Tag Archives: jupyter notebook

Transposable Element Mapping – Olympia Oyster Genome Assembly, Olurida_v081, using RepeatMasker 4.07

I previously performed this analysis using a different version of our Ostrea lurida genome assembly. Steven asked that I repeat the analysis with a modified version of the genome assembly (Olurida_v081) – only has contigs >1000bp in length.

Genome used: Olurida_v081

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats
  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   17954347 bp ( 1.67 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            16599       978030 bp    0.09 %
      ALUs            1          292 bp    0.00 %
      MIRs          937        72873 bp    0.01 %

LINEs:             3279       752631 bp    0.07 %
      LINE1         172        10882 bp    0.00 %
      LINE2         646        67827 bp    0.01 %
      L3/CR1        659        60327 bp    0.01 %

LTR elements:       569       127808 bp    0.01 %
      ERVL           32         1949 bp    0.00 %
      ERVL-MaLRs     10          490 bp    0.00 %
      ERV_classI    165        17699 bp    0.00 %
      ERV_classII    26         1590 bp    0.00 %

DNA elements:      1911       161957 bp    0.02 %
     hAT-Charlie     74         4216 bp    0.00 %
     TcMar-Tigger   584        24985 bp    0.00 %

Unclassified:        78         9834 bp    0.00 %

Total interspersed repeats:  2030260 bp    0.19 %


Small RNA:         5592       409456 bp    0.04 %

Satellites:         117        21278 bp    0.00 %
Simple repeats:  270784     12935570 bp    1.20 %
Low complexity:   42130      2568284 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:  152816516 bp ( 14.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       193250     67253771 bp    6.24 %
   SINEs:             2087       284274 bp    0.03 %
   Penelope         158576     56080082 bp    5.21 %
   LINEs:           179430     61300904 bp    5.69 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        675       348273 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          7        10781 bp    0.00 %
     RTE/Bov-B        7051      1827344 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     11733      5668593 bp    0.53 %
     BEL/Pao          1517       871288 bp    0.08 %
     Ty1/Copia          78        72481 bp    0.01 %
     Gypsy/DIRS1      9151      4445789 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     233691     33727339 bp    3.13 %
   hobo-Activator    17578      1886743 bp    0.18 %
   Tc1-IS630-Pogo    39184      6403235 bp    0.59 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7261      1003937 bp    0.09 %
   Tourist/Harbinger  8635       823434 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       157855     36675484 bp    3.40 %

Total interspersed repeats:   137656594 bp   12.78 %


Small RNA:             222        72690 bp    0.01 %

Satellites:           6260      1238331 bp    0.11 %
Simple repeats:     241081     11662466 bp    1.08 %
Low complexity:      38915      2347827 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   36996910 bp ( 3.43 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        59806      9886111 bp    0.92 %
   SINEs:            59806      9886111 bp    0.92 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       8720      2230426 bp    0.21 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        47005      9434652 bp    0.88 %

Total interspersed repeats:    21551189 bp    2.00 %


Small RNA:           60030      9959172 bp    0.92 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     259134     12795379 bp    1.19 %
Low complexity:      42184      2581162 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Olurida_v081.fa          
sequences:        159429
total length: 1140787867 bp  (1077373535 bp excl N/X-runs)
GC level:         36.58 %
bases masked:   15918797 bp ( 1.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             224        73061 bp    0.01 %

Satellites:              8         5100 bp    0.00 %
Simple repeats:     273098     13256460 bp    1.23 %
Low complexity:      42443      2592212 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+



Share

Transposable Element Mapping – Crassostrea virginica NCBI Genome Assembly using RepeatMasker 4.07

Genome used: NCBI GCA_002022765.4_C_virginica-3.0

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 with species set to Crassotrea virginica.

All commands were documented in a Jupyter Notebook (GitHub):


RESULTS:

Output folder:

Output table (GFF):

Summary table (text):

==================================================
file name: GCF_002022765.2_C_virginica-3.0_genomic.fasta
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Share

Transposable Element Mapping – Olympia Oyster Genome Assembly using RepeatMasker 4.07

Steven wanted transposable elements (TEs) in the Olympia oyster genome identified.

After some minor struggles, I was able to get RepeatMasker installed on on both of our Apple Xserves (emu & roadrunner; running Ubuntu 16.04LTS).

Genome used: pbjelly_sjw_01

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats
  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   20002806 bp ( 1.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            17794      1061170 bp    0.09 %
      ALUs          363        31340 bp    0.00 %
      MIRs         1166        92129 bp    0.01 %

LINEs:             4456       888114 bp    0.08 %
      LINE1         976       103929 bp    0.01 %
      LINE2         813        82891 bp    0.01 %
      L3/CR1        699        63627 bp    0.01 %

LTR elements:      1187       199118 bp    0.02 %
      ERVL          155        15828 bp    0.00 %
      ERVL-MaLRs    200        20737 bp    0.00 %
      ERV_classI    379        42833 bp    0.00 %
      ERV_classII    66         6896 bp    0.00 %

DNA elements:      2290       196866 bp    0.02 %
     hAT-Charlie    190        15468 bp    0.00 %
     TcMar-Tigger   732        37473 bp    0.00 %

Unclassified:       101        12946 bp    0.00 %

Total interspersed repeats:  2358214 bp    0.20 %


Small RNA:         5954       433422 bp    0.04 %

Satellites:         366        55705 bp    0.00 %
Simple repeats:  310641     14322152 bp    1.22 %
Low complexity:   47381      2844279 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:  160759267 bp ( 13.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       213132     69887654 bp    5.96 %
   SINEs:             2374       311974 bp    0.03 %
   Penelope         171792     57862186 bp    4.94 %
   LINEs:           195605     63430615 bp    5.41 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        731       357995 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL         13        11377 bp    0.00 %
     RTE/Bov-B        8085      1948581 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     15153      6145065 bp    0.52 %
     BEL/Pao          2119       955773 bp    0.08 %
     Ty1/Copia         101        75372 bp    0.01 %
     Gypsy/DIRS1     11776      4815361 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     256292     35689117 bp    3.04 %
   hobo-Activator    19847      2059651 bp    0.18 %
   Tc1-IS630-Pogo    43269      6806311 bp    0.58 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7935      1060296 bp    0.09 %
   Tourist/Harbinger  9503       887332 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       174943     38299211 bp    3.27 %

Total interspersed repeats:   143875982 bp   12.27 %


Small RNA:             280        78768 bp    0.01 %

Satellites:           7383      1362194 bp    0.12 %
Simple repeats:     278809     12982714 bp    1.11 %
Low complexity:      44078      2622506 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   39598953 bp ( 3.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        63882     10327611 bp    0.88 %
   SINEs:            63882     10327611 bp    0.88 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       9433      2307292 bp    0.20 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        51558      9836468 bp    0.84 %

Total interspersed repeats:    22471371 bp    1.92 %


Small RNA:           64164     10406776 bp    0.89 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     298612     14185090 bp    1.21 %
Low complexity:      47510      2866522 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   17617763 bp ( 1.50 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             282        79165 bp    0.01 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     313082     14662647 bp    1.25 %
Low complexity:      47785      2878201 bp    0.25 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Share

TrimGalore/FastQC/MultiQC – TrimGalore! RRBS Geoduck BS-seq FASTQ data (directional)

Earlier this week, I ran TrimGalore!, but set the trimming, incorrectly – due to a copy/paste mistake, as --non-directional, so I re-ran with the correct settings.

Steven requested that I trim the Geoduck RRBS libraries that we have, in preparation to run them through Bismark.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

All analysis is documented in a Jupyter Notebook; see link below.

Overview of process:

  1. Run TrimGalore! with --paired and --rrbs settings.

  2. Run FastQC and MultiQC on trimmed files.

  3. Copy all data to owl (see Results below for link).

  4. Confirm data integrity via MD5 checksums.

Jupyter Notebook:


Results:
TrimGalore! output folder:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):
Share

FastQC – RRBS Geoduck BS-seq FASTQ data

Earlier today I finished trimming Hollie’s RRBS BS-seq FastQ data.

However, the original files were never analyzed with FastQC, so I ran it on the original files.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

FastQC was run, followed by MultiQC. Analysis was run on Roadrunner.

All analysis is documented in a Jupyter Notebook; see link below.

Jupyter Notebook:

Results:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):
Share

TrimGalore/FastQC/MultiQC – TrimGalore! RRBS Geoduck BS-seq FASTQ data


20180516 – UPDATE!!

THIS WAS RUN WITH THE INCORRECT SETTING IN TRIMGALORE! --non-directional

WILL RE-RUN


Steven requested that I trim the Geoduck RRBS libraries that we have, in preparation to run them through Bismark.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

All analysis is documented in a Jupyter Notebook; see link below.

Overview of process:

  1. Copy EPI* FastQ files from owl/P_generosa to roadrunner.

  2. Confirm data integrity via MD5 checksums.

  3. Run TrimGalore! with --paired, --rrbs, and --non-directional settings.

  4. Run FastQC and MultiQC on trimmed files.

  5. Copy all data to owl (see Results below for link).

  6. Confirm data integrity via MD5 checksums.

Jupyter Notebook:


Results:
TrimGalore! output folder:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):
Share

Read Mapping – Mapping Illumina Data to Geoduck Genome Assemblies with Bowtie2

We have an upcoming meeting with Illumina to discuss how the geoduck genome project is coming along and to decide how we want to proceed.

So, we wanted to get a quick idea of how well our geoduck assemblies are by performing some quick alignments using Bowtie2.

Used the following assemblies as references:

  • sn_ph_01 : SuperNova assembly of 10x Genomics data

  • sparse_03 : SparseAssembler assembly of BGI and Illumina project data

  • pga_02 : Hi-C assembly of Phase Genomics data

The analysis is documented in a Jupyter Notebook.

Jupyter Notebook (GitHub):

NOTE: Due to large amount of stdout from first genome index command, the notebook does not render well on GitHub. I recommend downloading and opening notebook on a locally install version of Jupyter.

Here’s a brief overview of the process:

  1. Generate Bowtie2 indexes for each of the genome assemblies.
  2. Map 1,000,000 reads from the following Illumina NovaSeq FastQ files:

Results:

Bowtie2 Genome Indexes:

Bowtie2 sn_ph_01 alignment folder:

Bowtie2 sparse_03 alignment folder:

Bowtie2 pga_02 alignment folder:


MAPPING SUMMARY TABLE

All mapping data was pulled from the respective *.err file in the Bowtie2 alignment folders.

sequence_ID Assembler Alignment Rate (%)
sn_ph_01 SuperNova (10x) 79.89
sparse_03 SparseAssembler 85.83
pga_02 Hi-C (Phase Genomics) 79.90|

Mapping efficiency is similar for all assemblies. After speaking with Steven, we’ve decided we’ll begin exploring genome annotation pipelines.

Share

BS-seq Mapping – Olympia oyster bisulfite sequencing: TrimGalore > FastQC > Bismark

Steven asked me to evaluate our methylation sequencing data sets for Olympia oyster.

According to our Olympia oyster genome wiki, we have the following two sets of BS-seq data:

All computing was conducted on our Apple Xserve: emu.

All steps were documented in this Jupyter Notebook (GitHub): 20180503_emu_oly_methylation_mapping.ipynb

NOTE: The Jupyter Notebook linked above is very large in size. As such it will not render on GitHub. It will need to be downloaded to a computer that can run Jupyter Notebooks and viewed that way.

Here’s a brief overview of what was done.

Samples were trimmed with TrimGalore and then evaluated with FastQC. MultiQC was used to generate a nice visual summary report of all samples.

The Olympia oyster genome assembly, pbjelly_sjw_01, was used as the reference genome and was prepared for use in Bismark:


/home/shared/Bismark-0.19.1/bismark_genome_preparation 
--path_to_bowtie /home/shared/bowtie2-2.3.4.1-linux-x86_64/ 
--verbose /home/sam/data/oly_methylseq/oly_genome/ 
2> 20180507_bismark_genome_prep.err

Bismark was run on trimmed samples with the following command:


/home/shared/Bismark-0.19.1/bismark 
--path_to_bowtie /home/shared/bowtie2-2.3.4.1-linux-x86_64/ 
--genome /home/sam/data/oly_methylseq/oly_genome/ 
-u 1000000 
-p 16 
--non_directional 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/1_ATCACG_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/2_CGATGT_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/3_TTAGGC_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/4_TGACCA_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/5_ACAGTG_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/6_GCCAAT_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/7_CAGATC_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/8_ACTTGA_L001_R1_001_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_10_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_11_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_12_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_13_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_14_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_15_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_16_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_17_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_18_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_1_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_2_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_3_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_4_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_5_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_6_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_7_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_8_s456_trimmed.fq.gz 
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_9_s456_trimmed.fq.gz 
2> 20180507_bismark_02.err

Results:

TrimGalore output folder:

FastQC output folder:

MultiQC output folder:

MultiQC Report (HTML):

Bismark genome folder: 20180503_oly_genome_pbjelly_sjw_01_bismark/

Bismark output folder:


Whole genome BS-seq (2015)

Prep overview
  • Library prep: Roberts Lab
  • Sequencing: Genewiz
Bismark Report Mapping Percentage
1_ATCACG_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 40.3%
2_CGATGT_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 39.9%
3_TTAGGC_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 40.2%
4_TGACCA_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 40.4%
5_ACAGTG_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 39.9%
6_GCCAAT_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 39.6%
7_CAGATC_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 39.9%
8_ACTTGA_L001_R1_001_trimmed_bismark_bt2_SE_report.txt 39.7%

MBD BS-seq (2015)

Prep overview
  • MBD: Roberts Lab
  • Library prep: ZymoResearch
  • Sequencing: ZymoResearch
Bismark Report Mapping Percentage
zr1394_1_s456_trimmed_bismark_bt2_SE_report.txt 33.0%
zr1394_2_s456_trimmed_bismark_bt2_SE_report.txt 34.1%
zr1394_3_s456_trimmed_bismark_bt2_SE_report.txt 32.5%
zr1394_4_s456_trimmed_bismark_bt2_SE_report.txt 32.8%
zr1394_5_s456_trimmed_bismark_bt2_SE_report.txt 35.2%
zr1394_6_s456_trimmed_bismark_bt2_SE_report.txt 35.5%
zr1394_7_s456_trimmed_bismark_bt2_SE_report.txt 32.8%
zr1394_8_s456_trimmed_bismark_bt2_SE_report.txt 33.0%
zr1394_9_s456_trimmed_bismark_bt2_SE_report.txt 34.7%
zr1394_10_s456_trimmed_bismark_bt2_SE_report.txt 34.9%
zr1394_11_s456_trimmed_bismark_bt2_SE_report.txt 30.5%
zr1394_12_s456_trimmed_bismark_bt2_SE_report.txt 35.8%
zr1394_13_s456_trimmed_bismark_bt2_SE_report.txt 32.5%
zr1394_14_s456_trimmed_bismark_bt2_SE_report.txt 30.8%
zr1394_15_s456_trimmed_bismark_bt2_SE_report.txt 31.3%
zr1394_16_s456_trimmed_bismark_bt2_SE_report.txt 30.7%
zr1394_17_s456_trimmed_bismark_bt2_SE_report.txt 32.4%
zr1394_18_s456_trimmed_bismark_bt2_SE_report.txt 34.9%
Share

Assembly Comparisons – Oly Assemblies Using Quast

I ran Quast to compare all of our current Olympia oyster genome assemblies.

See Jupyter Notebook in Results section for Quast execution.

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/quast_results/results_2018_01_16_10_08_35/

Heatmapped table of results: http://owl.fish.washington.edu/Athaliana/quast_results/results_2018_01_16_10_08_35/report.html

Very enlightening!

After all the difficulties with PB Jelly, it has produced the most large contigs. However, it does also have the highest quantity and rate of N’s of all the assemblies produced to date.

BEST OF:

contigs (>= 50000 bp): pbjelly_sjw_01 (894)

Largest Contig: redundans_sjw_02 (322,397bp)
Total Length: pbjelly_sjw_01 (1,180,563,613bp)
Total Length (>=50,000bp): pbjelly_sjw_01 (57,741,906bp)
N50: redundans_sjw_03 (17,679bp)

Jupyter Notebook (GitHub): 20180116_swoose_oly_assembly_comparisons_quast.ipynb

Share

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

After another attempt to fix PB Jelly, I ran it again.

We’ll see how it goes this time…

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/

Output FASTA file: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/jelly.out.fasta

Quast assessment of output FASTA:

Assembly jelly.out
# contigs (>= 0 bp) 696946
# contigs (>= 1000 bp) 159429
# contigs (>= 5000 bp) 68750
# contigs (>= 10000 bp) 35320
# contigs (>= 25000 bp) 7048
# contigs (>= 50000 bp) 894
Total length (>= 0 bp) 1253001795
Total length (>= 1000 bp) 1140787867
Total length (>= 5000 bp) 932263178
Total length (>= 10000 bp) 691523275
Total length (>= 25000 bp) 261425921
Total length (>= 50000 bp) 57741906
# contigs 213264
Largest contig 194507
Total length 1180563613
GC (%) 36.57
N50 12433
N75 5983
L50 26241
L75 60202
# N’s per 100 kbp 6580.58

Have added this assembly to our Olympia oyster genome assemblies table.

This took an insanely long time to complete (nearly six weeks)!!! After some internet searching, I’ve found a pontential solution to this and have initiated another PB Jelly run to see if it will run faster. Regardless, it’ll be interesting to see how the results compare from two independent runs of PB Jelly.

Jupyter Notebook (GitHub): 20171130_emu_pbjelly.ipynb

Share