That’s 20 directories and 127 files – for a single SMRT cell!
Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.
I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).
After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!
Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).
At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:
Yield is good and we have more than enough (~5μg is required for sequencing) to proceed with sequencing.
Evaluated gDNA quality (i.e. integrity) by running ~500ng (1.5μL) of sample on 0.8% agarose, low-TAE gel stained with ethidium bromide.
Used 5μL of O’GeneRuler DNA Ladder Mix (ThermoFisher).
Overall, the gel looks OK. A fair amount of smearing, but a strong, high molecular weight band is present. The intensity of the smearing is likely due to the fact that the gel is overloaded for this particular well size. If I had used a broader comb and/or loaded less DNA, the band would be more defined and the smearing would be less prominent.
We received info to download the final data and genome assembly files for geoduck and Olympia oyster from BGI.
In total, the downloads took a little over three days to complete!
The notebook detailing how the files were downloaded is below, but it should be noted that I had to strip the output cells because the output from the download command made the file too large to upload to GitHub, and the size of the notebook file would constantly crash the browser/computer that it was opened in. So, the notebook below is here for posterity.
Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334
Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).
Adding our Olympia oyster genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.
The data provided consisted of the following three files:
md5.txt – Checksum file to verify integrity of files after downloading.
N50.txt – Contains some very limited stats on scaffolds provided.
scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.
Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.
Downloaded data from the BGI project portal to our server, Owl, using the Synology Download Station. Although the BGI portal is aesthetically nice, it’s set up poorly for bulk downloads and took a few tries to download all of the files.
Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida
The readme.md file was updated to include project/file information.
The file manipulations were performed in a Jupyter notebook (see below).
Total reads generated for this project: 1,225,964,680
BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.