Category Archives: Olympia Oyster Genome Sequencing

Olympia Oyster Genome Sequencing

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb


DNA Isolation – Ostrea lurida DNA for PacBio Sequencing

In an attempt to improve upon the partial genome assembly we received from BGI, we will be sending DNA to the UW PacBio core facility for additional sequencing.

Isolated DNA from mantle tissue from the same Ostrea lurida individual used for the BGI sequencing efforts. Tissue was collected by Brent & Steven on 20150812.

Used the E.Z.N.A. Mollusc Kit (Omega) to isolate DNA from two separate 50mg pieces of mantle tissue according to the manufacturer’s protocol, with the following changes:

  • Samples were homogenized with plastic, disposable pestle in 350μL of ML1 Buffer
  • Incubated homogenate at 60C for 1.5hrs
  • No optional steps were used
  • Performed three rounds of 24:1 chloroform:IAA treatment
  • Eluted each in 50μL of Elution Buffer and pooled into a single sample

Quantified the DNA using the Qubit dsDNA BR Kit (Invitrogen). Used 1μL of DNA sample.

Concentration = 326ng/μL (Quant data is here [Google Sheet]: 20161214_gDNA_Olurida_qubit_quant

Yield is good and we have more than enough (~5μg is required for sequencing) to proceed with sequencing.

Evaluated gDNA quality (i.e. integrity) by running ~500ng (1.5μL) of sample on 0.8% agarose, low-TAE gel stained with ethidium bromide.

Used 5μL of O’GeneRuler DNA Ladder Mix (ThermoFisher).




Overall, the gel looks OK. A fair amount of smearing, but a strong, high molecular weight band is present. The intensity of the smearing is likely due to the fact that the gel is overloaded for this particular well size. If I had used a broader comb and/or loaded less DNA, the band would be more defined and the smearing would be less prominent.

Will submit sample to the UW PacBio facility tomorrow!


Data Management – Download Final BGI Genome & Assembly Files

We received info to download the final data and genome assembly files for geoduck and Olympia oyster from BGI.

In total, the downloads took a little over three days to complete!

The notebook detailing how the files were downloaded is below, but it should be noted that I had to strip the output cells because the output from the download command made the file too large to upload to GitHub, and the size of the notebook file would constantly crash the browser/computer that it was opened in. So, the notebook below is here for posterity.

Jupyter Notebook: 20161206_docker_BGI_genome_downloads.ipynb



Data Management – Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.


Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here:

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

  • 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
  • 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.


Data Management – Olympia Oyster Small Insert Library Genome Assembly from BGI

Received another set of Ostrea lurida genome assembly data from BGI. In this case, it’s data assembled from the small insert libraries they created for this project.

All data is stored here:

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Olympia oyster genome size: 1898.92 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20160516_Oly_Small_Insert_Library_Genome_Read_Counts.ipynb



SRA Submission – Genome sequencing of the Olympia oyster (Ostrea lurida)

Adding our Olympia oyster genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:


Mate pair sequencing files were uploaded together within a single “Run”.


Data Received – Initial Olympia oyster Genome Assembly from BGI

The initial assembly of the Ostrea lurida genome is available from BGI. Currently, we’ve stashed it here:

The data provided consisted of the following three files:

  • md5.txt
  • N50.txt
  • scaffold.fa.fill

md5.txt – Checksum file to verify integrity of files after downloading.

N50.txt – Contains some very limited stats on scaffolds provided.

scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.

Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.


Data Received – Ostrea lurida genome sequencing files from BGI

Downloaded data from the BGI project portal to our server, Owl, using the Synology Download Station. Although the BGI portal is aesthetically nice, it’s set up poorly for bulk downloads and took a few tries to download all of the files.

Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl:

The file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).


Total reads generated for this project: 1,225,964,680

BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.


Jupyter Notebook file: 20160126_Olurida_BGI_data_handling.ipynb

Notebook Viewer: 20160126_Olurida_BGI_data_handling.ipynb