Tag Archives: R

BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

Will take the full data sets through this whole pipeline.


Computing – An Excercise in Futility

Trying to continue my Oly GBS analsyis from the other day and follow along with Katherine Silliman’s notebook

However, I hit a major snag: I can’t seem to run R in my Jupyter notebook! This is a major pain, since the notebook needs to be able to switch between languages; that’s the beauty of using these notebooks.

Below, is a documentation of my torments today.

Currently, I’m creating a new Docker image that uses the Debian Apt repository to install R version 3.1.1. Going through Apt instead of installing from source (as I had been previously done in the Dockerfile) should install all the necessary dependencies for R and get resolve some of the error messages I’m seeing.

Otherwise, the last resort will be to use R outside of the notebook and document that process separately.

Anyway, this is the kind of stuff that is immensely time consuming and frustrating that most people don’t realize goes on with all of this computing stuff…

Notebook: 20161129_docker_R_magics_failure.ipynb


Docker – Improving Roberts Lab Reproducibility

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been  working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0
bismark 0.15.0
blast 2.3.0+
bowtie2 2.2.8
bsmap 2.90
cufflinks 2.1.1
fastqc 0.11.5
fastx_toolkit 0.0.13
R 3.2.5
RStudio Server0.99
pyrad 3.0.66
samtools 0.1.19
stacks 1.40
tophat 2.1.1
trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.


Fidalgo offspring at two locations

We carried out whole genome BS-Seq on siblings outplanted out at two sites: Fidalgo Bay (home) and Oyster Bay. Four individuals from each locale were examined.

A running description of the data is available @ https://github.com/RobertsLab/project-olympia.oyster-genomic/wiki/Whole-genome-BSseq-December-2015.

I need to look back to a genome to analyze this. We did some PacBio sequencing a while ago.
– http://nbviewer.jupyter.org/github/sr320/ipython_nb/blob/master/OlyO_PacBio.ipynb

In recap, the fastq file had 47,475 reads: http://owl.fish.washington.edu/halfshell/OlyO_Pat_PacBio_1.fastq

3058 of these reads were >10k bp: http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k.fa

Those 3058 reads were nicely assembled into 553 contigs: http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

Step forward a bit and all 47475 reads were assembled into the 5362 contigs known as OlyO_Pat_v02.fa http://owl.fish.washington.edu/halfshell/OlyO_Pat_v02.fa

The latter (v02) was used to map the 8 libraries. Roughly getting about 8% mapping

About 15 fold average coverage

And with a little filtering

Note that awk script filtered for 10x coverage! this could be altered.

and R have an intriguing relationship

With BGI Draft Genome

Following the same workflow with the BGIv1 scaffolds >10k bp have about 16% or reads map.

3 fold coverage

again, making sure there is 10x coverage at a given CG loci
we get

Much weaker if we allow only 3x coverage at a given CG loci

and the bit of R code



file.list ‘mkfmt_2_CGATGT.txt’,


hc PCA<-PCASamples(meth)


MarineSPEED: broke git, lost some work

Lost a lot of time today trying to recover files that got lost due to incorrect of the bfg tool (used for removing big files in git). Finally decided to restart the process, again 200 species left to process. Lesson learned: don’t mess with git when you’re in a hurry.

I added a small performance improvement by adding some memoization in accessing raster layer values by cell indexes:


setCacheRootPath(path="./.Rcache") # put .Rcache directory in the current working dir

get_values_layers <- function(layers, cells) {
get_values_layers <- addMemoization(get_values_layers)

Next step is to prepare files for downloading.

Last weekend I encountered following 3 interesting papers:

And also make sure to try out the zoon package if you haven’t already.


MarineSPEED slow data preparation and improving the UI

Distribution records

Only the records for 190 species got processed last night. With another 300+ species to go performance could be improved. The bottleneck is looking up the environmental data so I tried to improve this part of the process.

Old version:

 add_environment <- function(row, data, layers) {
   print("add environment")
   environment <- extract(layers, data[,xycols])
   data <- cbind(data, environment)
   write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)

New version:

 add_environment <- function(row, data, layers) {
   print("add environment")
   cells <- cellFromXY(layers, data[,lonlat])
   unique_cells <- unique(cells)
   matched_cells <- match(cells, unique_cells)
   environment <- extract(layers, unique_cells)
   data <- cbind(data, environment[matched_cells,])
   write.csv2(data, paste0("3_environment/", row$ScientificName, " ", row$AphiaID, ".csv"), row.names = FALSE)

But it didn’t improve the performance and might even have slowed everything down but its hard to test because there are huge caching effects.

I’ve finished writing the species info files with traits from WoRMS. Not sure it will be useful relevant information to try to include in modeling but we’ll see.

Shiny UI

I’ve also been working on the MarineSPEED UI. There where some bugs which are now gone and I’ve added a Leaflet map with species points but this is rather slow with large amounts of records (45000 records). So I’ve filtered out records in order to have less then 2000 points per species which still gives a good idea of where the points are without slowing down the map.

Useful leaflet help of the day: http://stackoverflow.com/questions/32107667/clearshapes-not-working-leaflet-for-r/32118413

Next week

  • Create a poster for the VLIZ Marine Scientist conference
  • MarineSPEED viewer:
    • add some overview information
    • create 5-fold/10-fold random cross validation sets of the data (filtered with environmental data)
    • add download links for the data
    • deploy the viewer and link marinespeed.org to it
    • add a short about page

Some progress on MarineSPEED.org


Redownloaded all OBIS data.
Prepared lab records in correct format.
Prepared code for (see gist):

  • combining all records: GBIF (hasGeoSpatialIssue == “false”), OBIS (qc=1,2,3,4,5,6), Reef Life Survey, lab data
  • filter duplicates (exact same coordinates)
  • filter rounded coordinates
  • add environmental data with my experimental sdmpredictors package
  • move points without environmental data up to 1 km
  • filter points without environmental data
  • filter duplicate points within the same 25km² Behrmann equal area grid

Today I learned

  • read.csv accepts gzipped file paths
  • GeoTiffs don’t support the Mollweide projection which gives strange results when you have an existing QGIS project. Solution is to set the projects projection to the one stored in the GeoTiff.

Good to know

With the R package robis you can download species records with their qc values and the occurrence method has a qc parameter to only download records that pass specific qc flags.

See help(qc) for more information on the different flags.

In case you forgot to filter during the request you can still filter using the filter_qc function in the gist of the day. The code would have been a lot shorter if there where bitwise AND and OR operators for 64-bit integers.

R: Quickly installing and opening multiple packages

Often when doing analyses in R you need to install and load many packages, and it can be tedious to type texttt{install.packages()} and texttt{library()} a thousand times. The code below illustrates an efficient way of installing and opening up many packages instantly. You only need to change vector x and possibly comment out the second line of code if you have already installed the packages. Efficiency for installing packages could be improved by not reinstalling packages, but I have not checked into this yet.

#Insert packages you want to install and open in vector x

Awkward default for the function t.test in R

I was checking the texttt{t.test} function in R, and found out that by default it assumes unequal variances. This de facto implies that the assumption is violated, unless specified not to be the case. The main problem that this causes is that even when there is independence between groups, the test will correct for the correlation, even if it is tiny. This causes the degrees of freedom to deviate from round numbers, and might cause confusion for readers if copied directly into a paper (I have tried some things and it does not seem to affect the results).

To correct for this, the default argument texttt{var.equal=F} should be set to true, i.e. texttt{var.equal=T}. The code below shows the difference in degrees of freedom between the two (directly pastable into R).

#Check that t.test function is correct for Independent samples t.test for two cases
t1<-t.test(x1,x2,paired=F,var.equal=T,digits=6) #Beware that default is var.equal=F 
#Compare degrees of freedom between var.equal=T and var.equal=F