# Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb

# Data Management – Script for Compiling qPCR Data

Over the last few weeks, I’ve wrestled with tracking down data (primarily qPCR data) from the litany of projects the Friedman Lab has had over the last decade or so. I’ve also noticed that it’s increasingly difficult for me to track down my own data from individual projects where data is not generated continuously over time, but in chunks. Tracking down 6 different qPCR runs that were conducted over the course of a year is tedious.

In order to save myself, and others who might need/want to review my data, a lot of time in the future, I decided to write a script that will allow me to compile all of my qPCR data into a single, massive CSV file. Accessing this CSV file via spreadsheet (Excel/Calc/Sheets) or database (SQL) means I’ll always be able to quickly search for all the related data I need, since it will reside in a single file!

The script is written in bash, called qpcr_aggregation.sh, and is currently hosted here: https://github.com/kubu4/Scripts/blob/master/bash/qpcr_aggregation.sh

Basically, the script does the following:

• Replace the spaces in the BioRad filenames with underscores (used to simplify parsing of the filename for use in downstream steps in the script)
• Replace the header row to accommodate two new fields: qPCR_filename and qPCR_date
• Add qPCR_filename and qPCR_date to each file.
• Concatenate all the files into a single “master” CSV file.

Now, the easy part – exporting the data from hundreds of qPCR files…

# Data Analysis – Identification of duplicate files on Eagle

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

Here are the cleaned output files from the fslint analysis:

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

# Data Received – Oly 2bRAD Illumina Sequencing from Genewiz

The data was made available to use on 20151224 and took two days to download.

The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): White_BS1511196_R2_barcodes

The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we’ll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

Find and copy all FASTQ files:

find /run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=home/ -name '*.fastq.*' -exec cp -n '{}' /run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=web/nightingales/O_lurida/ ;

Code explanation:

find
• Command line program used for searching for files
/run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=home/ 
• Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted.
-name '*.fastq.*'
• The name argument tells the find command to look for filenames that have “.fastq” in them.
-exec cp -n '{}'
• The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I’m using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files.
/run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=web/nightingales/O_lurida/2bRAD_Dec2015 ;
• The location where I want the matched files copied.

I created a readme file in the directory directory with these files: readme.md

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in *.gz; do linecount=gunzip -c "$i" | wc -l; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo$totalreads

Code explanation:

totalreads=0;
• Creates variable called “totalreads” and initializes value to 0.
for i in *.gz;
• Initiates a for loop to process any filenames that end with “.gz”. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
• Creates variable called “linecount” that stores the results of the following command:
gunzip -c "$i" | wc -l; • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4)); • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable. totalreads=$((readcount+totalreads));
• Adds the readcount for the current file and adds the value to totalreads.
done;
• End the for loop.
echo $totalreads • Prints the value of totalreads to the screen. Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file. for i in *.gz; do linecount=gunzip -c "$i" | wc -l; readcount=$(($linecount/4)); printf "%st%sn%sttn" "$i" "$readcount" >> readme.md; done

Code explanation:

Start sqlite:

$sqlite3 Tell sqlite that the field separator will be commas (i.e. CSV file): sqlite>.separator "," Import the CSV file and provide a name for the resulting database: sqlite>.import Black_Abalone.csv BlackAbs Set output display mode to column for easier reading: sqlite>.mode column Set output display to include column headers: sqlite>.headers on To select all the samples that have scores of 0 in both PE and DG RLO fields (screen cap does not show entire output list): To select all the samples that have scores of 1 in both PE and DG RLO fields: To select all the samples that have scores of 2 in both PE and DG RLO fields: Here are the full set of results in a table  RLO/RLOv 0 RLO/RLOv 1 RLO/RLOv 2 06:5-03 06:5-35A 06:5-31 06:5-04 06:50-08 06:5-32B 06:5-08 06:50-10 06:6-46 06:5-09 06:6-32 06:6-49 06:5-10 06:6-39 08:3-05 06:5-11 06:6-42 08:3-07 06:5-14 06:6-44 08:3-15 06:5-16 06:6-52 08:3-16 06:5-18 06:6-54 06:5-20 07:12-18 06:5-21 08:3-08 06:5-22 08:3-10 06:5-24 06:5-30 06:50-04 06:50-05 06:50-11 06:50-12 06:50-13 06:50-15 06:50-16 06:6-01 06:6-02 06:6-03 06:6-05 06:6-08 06:6-11 06:6-12 06:6-13 06:6-15 06:6-16 06:6-17 06:6-18 06:6-20 06:6-21 06:6-22 06:6-23 06:6-24 06:6-25 06:6-26 06:6-27 06:6-28 07:12-01 07:12-02 07:12-03 07:12-04 07:12-05 07:12-06 07:12-07 07:12-09 07:12-10 07:12-13 07:12-19 08:3-01 08:3-02 08:3-03 08:3-04 08:3-13 08:4-01 08:4-02 08:4-03 08:4-04 08:4-05 08:4-06 08:4-07 08:4-08 08:4-09 08:4-10 08:4-11 08:4-12 08:4-13 08:4-14 08:4-15 08:4-16 08:4-17 08:4-18 08:4-19 08:4-20 08:4-21 08:4-22 08:4-23 08:4-24 08:4-25 08:5-06 Will select just 10 of those in the RLO/RLOv 0 column for use in qPCR. I was able to track down the boxes where are these DNAs were stored (see images below). Boxes that were not labeled with accession numbers of the samples contained therein are now labeled. Boxes that contained samples that belonged in other boxers were transferred to the appropriate box. All boxes were located, and returned, to the big -20C in 240 on Lisa’s shelf. # From Ensenada and beyond There have not been many posts recently, but that is not to say I have not been doing any science. Much of what I have been doing is numerous burst commits on the Panopea transcriptome paper / project. This can be found @ https://github.com/sr320/paper-pano-go. You can see all the Jupyter nbs in this sub-directory. I will highlight some of my proudest cell moments here: Ok the first one is so good I will just give you the whole thing except that I will provide just a little more comments ## !wc -l analyses/Geoduck-transcriptome-v2.tab 154407 analyses/Geoduck-transcriptome-v2.tab !head -5 analyses/Geoduck_v2_blastn-NT.out comp190_c0_seq1 gi 315593157 gb CP002417.1 84.50 200 31 0 1 200 2271015 2271214 1e-47 198 Bacteria Variovorax paradoxus EPS, complete genome 595537 Variovorax paradoxus EPS Variovorax paradoxus EPS b-proteobacteria comp1900_c0_seq1 gi 481319564 gb CP003293.1 100.00 271 0 0 1 271 1334370 1334640 1e-138 501 Bacteria Propionibacterium acnes HL096PA1, complete genome 1134454 Propionibacterium acnes HL096PA1 Propionibacterium acnes HL096PA1 high GC Gram+ comp2164_c0_seq1 gi 221728669 gb CP001392.1 98.47 261 4 0 1 261 721134 721394 2e-126 460 Bacteria Acidovorax ebreus TPSY, complete genome 535289 Acidovorax ebreus TPSY Acidovorax ebreus TPSY b-proteobacteria comp2742_c0_seq1 gi 527256352 ref XM_005146392.1 85.65 230 33 0 16 245 2293 2522 7e-61 243 Eukaryota PREDICTED: Melopsittacus undulatus exostosin-like glycosyltransferase 3 (EXTL3), mRNA 13146 Melopsittacus undulatus budgerigar birds comp3315_c0_seq1 gi 156627645 gb AC209228.1 79.13 206 36 6 3 202 76584 76380 1e-28 135 Eukaryota Populus trichocarpa clone POP075-L19, complete sequence 3694 Populus trichocarpa black cottonwood eudicots #Lets subset above table to non Eukaryotes #Here I am letting awk know we are dealing with tabs, #and I want to have all rows where column 17 is NOT Eukaryota. !awk -F"t" '$17 != "Eukaryota" {print $1,$17 ,$15}' analyses/Geoduck_v2_blastn-NT.out > analyses/Non-Eukaryota-Geoduck-v2 !sort analyses/Non-Eukaryota-Geoduck-v2 > analyses/Non-Eukaryota-Geoduck-v2.sorted !sort analyses/Geoduck-transcriptome-v2.tab > analyses/Geoduck-transcriptome-v2.sorted !head -2 analyses/Geoduck-transcriptome-v2.sorted comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA # joining with -v suppressing joined lines !join -v 1 analyses/Geoduck-transcriptome-v2.sorted analyses/Non-Eukaryota-Geoduck-v2.sorted | wc -l 153982 !join -v 1 analyses/Geoduck-transcriptome-v2.sorted analyses/Non-Eukaryota-Geoduck-v2.sorted > ../data-results/Geoduck-transcriptome-v3.tab !head -2 ../data-results/Geoduck-transcriptome-v3.tab comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA #Going from tab back to fasta! !awk '{print ">"$1"n"$2}' ../data-results/Geoduck-transcriptome-v3.tab > ../data-results/Geoduck-transcriptome-v3.fa !fgrep -c ">" ../data-results/Geoduck-transcriptome-v3.fa 153982  tldr: '$c != string  join -v  awk '{print ">"$1"n"$2}'


One more tidbit- I wanted to see how many blast hits were in the opposite direction "- frame".

thus:

!awk '($10-$9) < 0 {print $1, "t", ($10-\$9)}'
../data-results/Geoduck-tranv2-blastx_sprot.tab
> analyses/Geoduck-tranv2-minus_direction.tab
!wc -l analyses/Geoduck-tranv2-minus_direction.tab

comp95_c0_seq1   -230
comp146_c0_seq1      -173
comp195_c0_seq1      -185
comp296_c0_seq1      -200
comp455_c1_seq1      -197
comp943_c0_seq1      -218
comp1059_c0_seq1     -227
comp1683_c0_seq1     -206
comp1868_c0_seq1     -308
comp1910_c1_seq1     -248
10413 analyses/Geoduck-tranv2-minus_direction.tab
`