Tag Archives: igv

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.


One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS


Alternative splicing and germline methylation

One of the premises related to germline methylation in oysters is that genes with limited methylation will have more alternatively spliced products. One of our most robust identification of alternatively spliced genes is from the heat shock experiment, however methylation was assessed with array to ID DMRs. In order to gauge the relationship of germline methylation and alternative splicing I looked at all of our data and made the calls.

I simply looked at the 42 areas determined to be alternatively spliced in the three individuals and qualitatively made the call. For example..


Of the 42 genes, only 11 were determined have limited methylation.

scaffold1391:350297-393525 - High
scaffold1501:189-2280 -  Low-Medium
scaffold1546:22946-41272 - High
scaffold157:93056-102166 - High
scaffold157:288396-298950 - High 
scaffold1583:636568-663717 - High
scaffold1630:57333-61806 - Low
scaffold1643:190932-200778 - Low 
scaffold1670:360106-365501 - High
scaffold1750:71251-77856 - High
scaffold1009:677703-719650 - High
scaffold1009:990592-1008075 - High 
scaffold193:111771-117728 - High 
scaffold198:1032767-1055090 - Medium-High
scaffold198:1084454-1102022 - High
scaffold211:954418-990421 - Medium-High
scaffold351:641567-648889 - Low 
scaffold102:1297353-1322657 - Low-High
scaffold383:99975-117650 - Low-High
scaffold38366:25577-54928 - Low 
scaffold1024:1037898-1043721 - Medium-High
scaffold395:85069-105025 - High
scaffold399:120007-128313 - High
scaffold41228:55316-64219 - Low 
scaffold41858:126164-135361 - High
scaffold42366:124115-157800- High
scaffold42892:55315-57225 - Low 
scaffold42904:154963-177485 - High 
scaffold43208:19552-46201 - High 
scaffold43940:65971-85144 - High
scaffold452:65648-77493 - Low 
scaffold527:16020-64639 - Low 
scaffold588:178668-183438 - High
scaffold616:53220-63262 - Low 
scaffold828:110697-114691 - High
scaffold942:369690-377892 - High
scaffold1160:333463-390372 - Low-High
scaffold1213:118721-121110 - Low 
scaffold128:428337-438047 - Low 
scaffold1282:23471-48121 - Low 
scaffold13:4222-16204 - Low
scaffold1322:265134-304245 - High

The caveat is (there always seems to be one) is this list of alternative spliced genes is based on consistency across individuals, something we would predict NOT to see if this was truly a stochastic phenomenon.


A closer look at DMRs

A detailed look at DMRs that hold true across oysters exposed to heat shock.

Updated July 6, 2015 – added three more DMRs (at bottom of post)

Going down the list, scaffold418_576986 is a feature that overlaps gene EKC36328, Bromodomain-containing protein 8. Specifically the location is in the intron between exon 18 and 19 (total of 20 exons). This gene is differentially expressed, that is expressed at an elevated level following heat shock.

“The precise function of the domain is unclear, but it may be involved in protein-protein interactions and may play a role in assembly or activity of multi-component complexes involved in transcriptional activation [PMID: 7580139].”


Another DMR that is consistent across oysters is located within the intron of Homeobox protein LOX2. Homeobox are transcription factors often associated with developmental processes.


Significant hypomethylation is also present within the intron of Tenascin, a glycoprotein expressed in the extracellular matrix during stress.


We also found a DMR upstream of E3 ubiquitin-protein ligase UHRF1. Interestingly this is a protein that bridges DNA methylation and chromatin modification.

“Specifically recognizes and binds hemimethylated DNA at replication forks via its YDG domain and recruits DNMT1 methyltransferase to ensure faithful propagation of the DNA methylation patterns through DNA replication. In addition to its role in maintenance of DNA methylation, also plays a key role in chromatin modification: through its tudor-like regions and PHD-type zinc fingers, specifically recognizes and binds histone H3 trimethylated at ‘Lys-9′ (H3K9me3) and unmethylated at ‘Arg-2′ (H3R2me0), respectively, and recruits chromatin proteins. Enriched in pericentric heterochromatin where it recruits different chromatin modifiers required for this chromatin replication. Also localizes to euchromatic regions where it negatively regulates transcription possibly by impacting DNA methylation and histone modifications. Has E3 ubiquitin-protein ligase activity by mediating the ubiquitination of target proteins such as histone H3 and PML. It is still unclear how E3 ubiquitin-protein ligase activity is related to its role in chromatin in vivo.” ~http://www.uniprot.org/uniprot/Q96T88

While not classified as a differentially expressed gene, there does appear to be a trend towards increased expression upon heat stress. This occurrence would follow the traditional model where decreased methylation in the promoter region is associated with increased expression.


There were three features identified that are in fact within an intron of Methylcrotonoyl-CoA carboxylase subunit alpha, mitochondrial, and enzyme involved in leucine and isovaleric acid catabolism.


In an another example of a DMR associated with a differentially expressed gene, a DMR that span an intron and exon within Myosin heavy chain, striated muscle. In this case the gene is expressed at a lower level upon heat stress. It is also worth pointing out this gene has very limited methylation overall based on other studies we have done.


Another interesting DMR was found in Methylated-DNA–protein-cysteine methyltransferase.


Within the intron of a Nacrein-like protein is a hypomethylated DMR. This is a negative regulator of calcification in shells of mollusks.


Collagen alpha-1(IV) chain is another gene that contains a hypomethylated DMR. This protein is the major structural component of basement membranes.


The only DMR that is hypermethylated is odd in the fact that annotation was dropped one the data was integrated into Ensembl. This could be related to the fact that the closest blast hit to this gene model is Insertion element IS1 protein insA and transposable element in prokaryotes.



Side track with tracks

Today working on our paper looking at heat stress and DNA methylation I dived deeper into the array data in the search for what should be called a DMR.

As a refresher we have tracks from the core that have 1.8+ fold difference (sig) and complementary tracks where there are three adjacents (3plusAdjacent). I made tracks where I merged the latter into a single feature when within 100bp of each other.

In order to see if there is any consistency across oysters..

#concatenated tracks
> /Users/sr320/git-repos/paper-Temp-stress/ipynb/analyses/mergHYPOcat.bed

#then using bedtools merge features (though first had to sort)
!bedtools sort -i /Users/sr320/git-repos/paper-Temp-stress/ipynb/analyses/mergHYPOcat.bed 
> /Users/sr320/git-repos/paper-Temp-stress/ipynb/analyses/mergHYPOcatsort.bed
!bedtools merge -c 2 -o count 
-i /Users/sr320/git-repos/paper-Temp-stress/ipynb/analyses/mergHYPOcatsort.bed | sort -nrk4

and so on for the hypermethylated region.

end of the AM, left with a new track

scaffold481 576986  578532  -3
scaffold247 141885  142442  -3
scaffold1518    212680  213736  -3
scaffold853 46186   46496   -2
scaffold406 419330  419384  -2
scaffold406 419005  419060  -2
scaffold406 418360  418767  -2
scaffold394 555813  556224  -2
scaffold247 144031  144583  -2
scaffold242 75918   76344   -2
scaffold142 656144  656735  -2
scaffold12  243960  244376  -2
scaffold257 1235165 1235481 +2

Jupyter Notebook

Could also do this on a less conservative approach by acting on (sig) tracks in bedtools


First steps at an aggregated view of all DNA methylation data (updated)

Seems like I have gotten close (see here) but do not have a canonical IGV session that has all of our DNA methylation data. The goal here is to generate such a product (and publish, so I do not lose it).

All data is publicly available at


see also data on Figshare


July 2, 2015 – added Heat Shock experiment alternative splice track
June 26, 2015 – add link to Figshare version
June 26, 2015 – updated Archive.zip
June 26, 2015 – added numerous array tracks from heat stress array experiment including 3+ tracks.
June 26, 2015 – added new track from heat stress – Heat-multi-individual-dmr.bed
June 22, 2015 – updated Archive.zip
June 22, 2015 – updated MBD-seq track gills (no bisulfite treatment) to use unique mapping (see also [this](MBD-seq track gills (no bisulfite treatment))
June 22, 2015 – Updated EE2 linkout to go to Github
June 22, 2015 – Corrected error in labelling EE2 experiment tracks
June 15, 2015 – added MBD-seq track gills (no bisulfite treatment)
June 15, 2015 – added larval pesticide treatment tracks (bisulfite treatment)
June 15, 2015 – new IGV screenshot
June 15, 2015 – added HS-Cuffdiff_geneexp.sig.gtf (differentially expressed genes from heat-shock)




FileID Description Links
Crassostrea_gigas.GCA_000297895.1.26.gtf gtf ftp
MBD-Gill-meth MBD enriched DNA library alignment paper, info
BiGill_CpG_methylation gill methylation 5x (MBD-BS, hi output) paper
BiGill_exon_clc_rpkm Corresponding exon-specific gene expression paper
BiGo_CpG_methylation male gamete methylation 5x (hi output) paper
M1 male gamete methylation 5x preprint
M3 male gamete methylation 5x preprint
T1D3 72hpf larvae from M1 methylation 5x preprint
T1D5 120hpf larvae from M1 methylation 5x preprint
T3D3 72hpf larvae from M3 methylation 5x preprint
T3D5 120hpf larvae from M3 methylation 5x preprint
Heat-multi-individual-dmr.bed Heat Stress (13 locations) common signal notebook
2M_3plusmerge_Hyper.bed merging adj probes to single interval notebook
2M_3plusmerge_Hypo.bed merging adj probes to single interval notebook
4M_3plusmerge_Hyper.bed merging adj probes to single interval notebook
4M_3plusmerge_Hypo.bed merging adj probes to single interval notebook
6M_3plusmerge_Hyper.bed merging adj probes to single interval notebook
6M_3plusmerge_Hypo.bed merging adj probes to single interval notebook
2M_Hyper_3plusAdjactentProbes.gff 3+ adjacent probes notebook
2M_Hypo_3plusAdjactentProbes.gff 3+ adjacent probes notebook
4M_Hyper_3plusAdjactentProbes.gff 3+ adjacent probes notebook
4M_Hypo_3plusAdjactentProbes.gff 3+ adjacent probes notebook
6M_Hyper_3plusAdjactentProbes.gff 3+ adjacent probes notebook
6M_Hypo_3plusAdjactentProbes.gff 3+ adjacent probes notebook
2M_sig Heat stress DMRs (array), ind.#2 notebook, draft
4M_sig Heat stress DMRs (array), ind.#4 notebook, draft
6M_sig Heat stress DMRs (array), ind.#6 notebook, draft
HS-Cuffdiff_geneexp.sig.gtf Heat stress differentially expressed genes notebook
HS-Cuffdiff_altsplice.bed Heat stress alternatively spliced genes notebook
2M.bedgraph.tdf RNA-seq from ind.#2 above – pretreament notebook, draft
4M.bedgraph.tdf RNA-seq from ind.#4 above – pretreament notebook, draft
6M.bedgraph.tdf RNA-seq from ind.#6 above – pretreament notebook, draft
2M-HS.bedgraph.tdf RNA-seq from ind.#2 above – post-heatshock notebook, draft
4M-HS.bedgraph.tdf RNA-seq from ind.#4 above – post-heatshock notebook, draft
6M-HS.bedgraph.tdf RNA-seq from ind.#6 above – post-heatshock notebook, draft
mgaveryDMRs_112212.gff EE2 exposure DMRs (array) paper
A01.smoothed EE2 exposure array data – input versus input paper
A02.smoothed EE2 exposure array data – EE2 vs control paper
A03.smoothed EE2 exposure array data – EE2 vs control (dyeswap) paper
YE_mixHYPER.bed DMRs in pesticide exposed larvae (hypermethylated)
YE_mixHYPO.bed DMRs in pesticide exposed larvae (hypomethylated)
YE_mix_22smCG3x larvae (mix pesticide exposed) methylation
YE_control_22smCG3x larvae (control) methylation


anyone should be able to render this in IGV with this session file:


This work was supported in part by the National Science Foundation (NSF) under Grant Number 1158119 awarded to SR Roberts


Wayback to just-MBD

Prior to bisulfite sequencing we did do a couple of MBD enrichment libraries to describe DNA methylation in oysters. Results even were snuck into this perspective.


While I am sure there are genome tracks around, I am ending up #doingitagain.

In short I took the raw Solid reads, align to Crassostrea_gigas.GCA_000297895.1.26.dna.genome in CLC, exported bam, converted to bedgraph, converted to tdf.

In long:
The raw files

1) Imported into CLC v8.0.1

          Discard read names = Yes
          Discard quality scores = No
          Original resource = /Users/sr320/data-genomic/tentacle/solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_SB_METH/solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_QV_SB_MOTH.qual
          Original resource = /Users/sr320/data-genomic/tentacle/solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_SB_METH/solid0078_20110412_FRAG_BC_WHITE_WHITE_F3_SB_MOTH.csfasta

(yes the core called them MOTH)

2) Reads were mapped


3) Exported as BAM.

4) Converted to bedgraph

-ibam /Users/sr320/data-genomic/tentacle/solid0078_moth.bam 
-g /Volumes/web/halfshell/qdod3/Cg.GCA_000297895.1.25.dna_sm.toplevel.genome 
> /Users/sr320/data-genomic/tentacle/MBD-meth.bedgraph          

5) Converted to toTDF


Rinse and repeat with unmethylated fraction (UNMOTH) and import tdf into IGV!