Tag Archives: checksum

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb

Share

Data Management – Tracking O.lurida FASTQ File Corruption

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.


 

Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

  • 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
  • 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

Share

Sequencing Data – C.gigas Larvae OA

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2212).

Samples sequenced/pooled for this run:

Sample Treatment Barcode
400ppm 400ppm GCCAAT
1000ppm 1000ppm CTTGTA

 

All code listed below was run on OS X 10.9.5

Ran a bash script called “download.sh” to download all the files. The script contents were:

#!/bin/bash
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_001.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_002.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_003.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_004.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_005.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_006.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_007.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_008.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_009.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_010.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_011.fastq.gz
curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_012.fastq.gz

 

Downloaded all 12 fastq.gz files to Owl/web/nightingales/C_gigas

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2212_:

$for file in 2212*lane2_NoIndex_L002_R1_0*; do mv "$file" "${file/#2212?fileName=/2212_}"; done

 

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

 

Create MD5 checksums for each the files: checkums.md5

$md5 2212* >> checksums.md5

Share

Sequencing Data – LSU C.virginica MBD BS-Seq

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2112).

Samples sequenced/pooled for this run:

Sample Treatment Barcode
HB2 25,000ppm oil ATCACG
HB16 25,000ppm oil TTAGGC
HB30 25,000ppm oil TGACCA
NB3 No oil ACAGTG
NB6 No oil GCCAAT
NB11 No oil CAGATC

All code listed below was run on OS X 10.9.5

Downloaded all 15 fastq.gz files to Owl/web/nightingales/C_virginica:

$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_001.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_002.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_003.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_004.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_005.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_006.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_007.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_008.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_009.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_010.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_011.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_012.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_013.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_014.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_015.fastq.gz

 

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2112_:

$for file in 2112*lane1_NoIndex_L001_R1_0*; do mv "$file" "${file/#2112?fileName=/2112_}"; done

 

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

 

Created MD5 checksums for each fastq.gz file: checksums.md5

$md5 *.gz >> checksums.md5

Share