Tag Archives: BGI

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb


Data Management – Download Final BGI Genome & Assembly Files

We received info to download the final data and genome assembly files for geoduck and Olympia oyster from BGI.

In total, the downloads took a little over three days to complete!

The notebook detailing how the files were downloaded is below, but it should be noted that I had to strip the output cells because the output from the download command made the file too large to upload to GitHub, and the size of the notebook file would constantly crash the browser/computer that it was opened in. So, the notebook below is here for posterity.

Jupyter Notebook: 20161206_docker_BGI_genome_downloads.ipynb



Data Management – Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.


Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

  • 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
  • 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.


Data Management – Geoduck Small Insert Library Genome Assembly from BGI

Received another set of Panopea generosa genome assembly data from BGI back in May! I neglected to create MD5 checksums, as well as a readme file for this data set! Of course, I needed some of the info that the readme file should’ve had and it wasn’t there. So, here’s the skinny…

It’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/P_generosa_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Geoduck genome size: 2972.9 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20161025_Pgenerosa_Small_Library_Genome_Read_Counts.ipynb


Data Analysis – Oly GBS Data Using Stacks 1.37

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb


Data Management – Olympia Oyster Small Insert Library Genome Assembly from BGI

Received another set of Ostrea lurida genome assembly data from BGI. In this case, it’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/O_lurida_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Olympia oyster genome size: 1898.92 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20160516_Oly_Small_Insert_Library_Genome_Read_Counts.ipynb



Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS


Fidalgo offspring at two locations

We carried out whole genome BS-Seq on siblings outplanted out at two sites: Fidalgo Bay (home) and Oyster Bay. Four individuals from each locale were examined.

A running description of the data is available @ https://github.com/RobertsLab/project-olympia.oyster-genomic/wiki/Whole-genome-BSseq-December-2015.

I need to look back to a genome to analyze this. We did some PacBio sequencing a while ago.
– http://nbviewer.jupyter.org/github/sr320/ipython_nb/blob/master/OlyO_PacBio.ipynb

In recap, the fastq file had 47,475 reads: http://owl.fish.washington.edu/halfshell/OlyO_Pat_PacBio_1.fastq

3058 of these reads were >10k bp: http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k.fa

Those 3058 reads were nicely assembled into 553 contigs: http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

Step forward a bit and all 47475 reads were assembled into the 5362 contigs known as OlyO_Pat_v02.fa http://owl.fish.washington.edu/halfshell/OlyO_Pat_v02.fa

The latter (v02) was used to map the 8 libraries. Roughly getting about 8% mapping

About 15 fold average coverage

And with a little filtering

Note that awk script filtered for 10x coverage! this could be altered.

and R have an intriguing relationship

With BGI Draft Genome

Following the same workflow with the BGIv1 scaffolds >10k bp have about 16% or reads map.

3 fold coverage

again, making sure there is 10x coverage at a given CG loci
we get

Much weaker if we allow only 3x coverage at a given CG loci

and the bit of R code



file.list ‘mkfmt_2_CGATGT.txt’,


hc PCA<-PCASamples(meth)


Data Management – O. lurida genotype-by-sequencing (GBS) data from BGI

We received a hard drive from BGI on 20160223 (while I was out on paternity leave) containing the Ostrea lurida GBS data.

Briefly, three sets (i.e. populations) of Olympia oyster tissue was collected from oysters raised in Oyster Bay and were sent to BGI for DNA extraction and GBS. A total of 23 individuals from each of the following three populations were sequenced (a grand total of 96 samples):

  • 1HL – (Hood Canal, Long Spit)
  • 1NF – (North Sound, Fidalgo Bay)
  • 1SN – (South Sound, Oyster Bay)

An overview of this project can be viewed on our GitHub Olympia oyster wiki.

Data was copied from the HDD to the following location on Owl (our server): http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

The data was generated from paired-end Illumina sequencing, so there are two FASTQ files for each individual.

The files were analyzed to create a MD5 checksum, perform read counts, and create a readme (markdown format) file. This was performed in a Jupyter/iPython notebook (see below).

IMPORTANT NOTE: The directory where this data is housed was renamed AFTER the Jupyter notebook was run. As such, the directory listed above will not be seen in the Jupyter notebook.

Jupyter notebook file: 20160314_Olurida_GBS_data_management.ipynb

Notebook Viewer: 20160314_Olurida_GBS_data_management.ipynb


Data Received – Initial Geoduck Genome Assembly from BGI

The initial assembly of the Ostrea lurida genome is available from BGI. Currently, we’ve stashed it here:


The data provided consisted of the following three files:

  • md5.txt
  • N50.txt
  • scaffold.fa.fill

md5.txt – Checksum file to verify integrity of files after downloading.

N50.txt – Contains some very limited stats on scaffolds provided.

scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.

Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.