Tag Archives: jupyter notebook

FASTQC – Oly BGI GBS Raw Illumina Data

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.



Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…


Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb


Data Received – Jay’s Coral RADseq and Hollie’s Geoduck Epi-RADseq

Jay received notice from UC Berkeley that the sequencing data from his coral RADseq was ready. In addition, the sequencing contains some epiRADseq data from samples provided by Hollie Putnam. See his notebook for multiple links that describe library preparation (indexing and barcodes), sample pooling, and species breakdown.

For quickest reference, here’s Jay’s spreadsheet with virtually all the sample/index/barcode/pooling info (Google Sheet): ddRAD/EpiRAD_Jan_16

I’ve downloaded both the demultiplexed and non-demultiplexed data, verified data integrity by generating and comparing MD5 checksums, copied the files to each of the three species folders on owl/nightingales that were sequenced (Panopea generosa, Anthopleura elegantissima, Porites astreoides), generated and compared MD5 checksums for the files in their directories on owl/nightingales, and created/updated the readme files in each respective folder.


Data management is detailed in the Jupyter notebook below. The notebook is embedded in this post, but it may be easier to view on GitHub (linked below).

Readme files were updated outside of the notebook.

Jupyter notebook (GitHub): 20170227_docker_jay_ngs_data_retrieval.ipynb


Data Management – Geoduck RRBS Data Integrity Verification

Yesterday, I downloaded the Illumina FASTQ files provided by Genewiz for Hollie Putnam’s reduced representation bisulfite geoduck libraries. However, Genewiz had not provided a checksum file at the time.

I received the checksum file from Genewiz and have verified that the data is intact. Verification is described in the Jupyter notebook below.

Data files are located here: owl/web/nightingales/P_generosa

Jupyter notebook (GitHub): 20161230_docker_geoduck_RRBS_md5_checks.ipynb


Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb


Data Managment – Trim Output Cells from Jupyter Notebook

Last week I downloaded the final BGI data files and assemblies for Olympia oyster and geoduck genome sequencing projects. However, the output from the download command made the Jupyter Notebook files too large to view and/or upload to GitHub. So, I had to trim the output cells from that notebook in order to render it usable/viewable.

The notebook below details how I did that and also examines the original version of that jumbo notebook to give some idea of what the command outputs were, for posterity.

Jupyter Notebook: 20161214_docker_notebook_trimming.ipynb


Data Management – Download Final BGI Genome & Assembly Files

We received info to download the final data and genome assembly files for geoduck and Olympia oyster from BGI.

In total, the downloads took a little over three days to complete!

The notebook detailing how the files were downloaded is below, but it should be noted that I had to strip the output cells because the output from the download command made the file too large to upload to GitHub, and the size of the notebook file would constantly crash the browser/computer that it was opened in. So, the notebook below is here for posterity.

Jupyter Notebook: 20161206_docker_BGI_genome_downloads.ipynb



Computing – An Excercise in Futility

Trying to continue my Oly GBS analsyis from the other day and follow along with Katherine Silliman’s notebook

However, I hit a major snag: I can’t seem to run R in my Jupyter notebook! This is a major pain, since the notebook needs to be able to switch between languages; that’s the beauty of using these notebooks.

Below, is a documentation of my torments today.

Currently, I’m creating a new Docker image that uses the Debian Apt repository to install R version 3.1.1. Going through Apt instead of installing from source (as I had been previously done in the Dockerfile) should install all the necessary dependencies for R and get resolve some of the error messages I’m seeing.

Otherwise, the last resort will be to use R outside of the notebook and document that process separately.

Anyway, this is the kind of stuff that is immensely time consuming and frustrating that most people don’t realize goes on with all of this computing stuff…

Notebook: 20161129_docker_R_magics_failure.ipynb


Data Analysis – Initial O.lurida Fst Determination from GBS Data

Finally running some analysis on the output from my PyRad analysison 20160727.

I’m following Katherine Silliman’s Jupyter notebook (2bRAD Subset Population Structure Analysis.ipynb) as a guide.

The initial analysis (which isn’t much) is in the Jupyter notebook below. The analysis will be continued on a later date.

Jupyter notebook: 20161117_docker_oly_vcf_analysis.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.


Data Management – Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.


Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

  • 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
  • 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.


Data Management – Geoduck Small Insert Library Genome Assembly from BGI

Received another set of Panopea generosa genome assembly data from BGI back in May! I neglected to create MD5 checksums, as well as a readme file for this data set! Of course, I needed some of the info that the readme file should’ve had and it wasn’t there. So, here’s the skinny…

It’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/P_generosa_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Geoduck genome size: 2972.9 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20161025_Pgenerosa_Small_Library_Genome_Read_Counts.ipynb