Tag Archives: bash

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb

Share

Data Management – Script for Compiling qPCR Data

Over the last few weeks, I’ve wrestled with tracking down data (primarily qPCR data) from the litany of projects the Friedman Lab has had over the last decade or so. I’ve also noticed that it’s increasingly difficult for me to track down my own data from individual projects where data is not generated continuously over time, but in chunks. Tracking down 6 different qPCR runs that were conducted over the course of a year is tedious.

In order to save myself, and others who might need/want to review my data, a lot of time in the future, I decided to write a script that will allow me to compile all of my qPCR data into a single, massive CSV file. Accessing this CSV file via spreadsheet (Excel/Calc/Sheets) or database (SQL) means I’ll always be able to quickly search for all the related data I need, since it will reside in a single file!

The script is written in bash, called qpcr_aggregation.sh, and is currently hosted here: https://github.com/kubu4/Scripts/blob/master/bash/qpcr_aggregation.sh

Basically, the script does the following:

  • Replace the spaces in the BioRad filenames with underscores (used to simplify parsing of the filename for use in downstream steps in the script)
  • Replace the header row to accommodate two new fields: qPCR_filename and qPCR_date
  • Add qPCR_filename and qPCR_date to each file.
  • Concatenate all the files into a single “master” CSV file.

Now, the easy part – exporting the data from hundreds of qPCR files…

Share

Data Analysis – Identification of duplicate files on Eagle

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

 

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

 

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

 

Here are the cleaned output files from the fslint analysis:

 

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

Share

Data Received – Oly 2bRAD Illumina Sequencing from Genewiz

The data was made available to use on 20151224 and took two days to download.

The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): White_BS1511196_R2_barcodes

The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we’ll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

Find and copy all FASTQ files:

find /run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=home/ -name '*.fastq.*' -exec cp -n '{}' /run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=web/nightingales/O_lurida/ ;

Code explanation:

find
  • Command line program used for searching for files
/run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=home/ 
  • Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted.
-name '*.fastq.*'
  • The name argument tells the find command to look for filenames that have “.fastq” in them.
-exec cp -n '{}'
  • The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I’m using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files.
/run/user/1000/gvfs/smb-share:server=owl.fish.washington.edu,share=web/nightingales/O_lurida/2bRAD_Dec2015 ;
  • The location where I want the matched files copied.

I created a readme file in the directory directory with these files: readme.md

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total Reads: 588,396,334

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in *.gz;
  • Initiates a for loop to process any filenames that end with “.gz”. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%st%sn%sttn" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%st%snn" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%st%snn” portion prints the value in $i as a string (%s), followed by a tab (t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (nn) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).
Share

Sample ID – Black Abalone DNA for RLOv qPCRs

Carolyn & Stan Langevin have agreed that the DNA helicase qPCR should be tested on 10 black abalone DNA extractions that fall into multiple levels of the Friedman Lab’s withering syndrome histology scoring.

Downloaded the (Google Sheet) Black Abalone: Expt 1 – WS & Phage as a CSV file. After downloading, I renamed the file (Black_Abalone.csv) to facilitate easier usage in the following steps.

Created a sqlite database using GitBash for Windows:
Change to directory where file is located:

$cd Downloads

Start sqlite:

$sqlite3

Tell sqlite that the field separator will be commas (i.e. CSV file):

sqlite>.separator ","

Import the CSV file and provide a name for the resulting database:

sqlite>.import Black_Abalone.csv BlackAbs

Set output display mode to column for easier reading:

sqlite>.mode column

Set output display to include column headers:

sqlite>.headers on

 

To select all the samples that have scores of 0 in both PE and DG RLO fields (screen cap does not show entire output list):

 

To select all the samples that have scores of 1 in both PE and DG RLO fields:

 

To select all the samples that have scores of 2 in both PE and DG RLO fields:

 

Here are the full set of results in a table

RLO/RLOv 0 RLO/RLOv 1 RLO/RLOv 2
06:5-03 06:5-35A 06:5-31
06:5-04 06:50-08 06:5-32B
06:5-08 06:50-10 06:6-46
06:5-09 06:6-32 06:6-49
06:5-10 06:6-39 08:3-05
06:5-11 06:6-42 08:3-07
06:5-14 06:6-44 08:3-15
06:5-16 06:6-52 08:3-16
06:5-18 06:6-54
06:5-20 07:12-18
06:5-21 08:3-08
06:5-22 08:3-10
06:5-24
06:5-30
06:50-04
06:50-05
06:50-11
06:50-12
06:50-13
06:50-15
06:50-16
06:6-01
06:6-02
06:6-03
06:6-05
06:6-08
06:6-11
06:6-12
06:6-13
06:6-15
06:6-16
06:6-17
06:6-18
06:6-20
06:6-21
06:6-22
06:6-23
06:6-24
06:6-25
06:6-26
06:6-27
06:6-28
07:12-01
07:12-02
07:12-03
07:12-04
07:12-05
07:12-06
07:12-07
07:12-09
07:12-10
07:12-13
07:12-19
08:3-01
08:3-02
08:3-03
08:3-04
08:3-13
08:4-01
08:4-02
08:4-03
08:4-04
08:4-05
08:4-06
08:4-07
08:4-08
08:4-09
08:4-10
08:4-11
08:4-12
08:4-13
08:4-14
08:4-15
08:4-16
08:4-17
08:4-18
08:4-19
08:4-20
08:4-21
08:4-22
08:4-23
08:4-24
08:4-25
08:5-06

Will select just 10 of those in the RLO/RLOv 0 column for use in qPCR.

I was able to track down the boxes where are these DNAs were stored (see images below).

Boxes that were not labeled with accession numbers of the samples contained therein are now labeled.

Boxes that contained samples that belonged in other boxers were transferred to the appropriate box.

All boxes were located, and returned, to the big -20C in 240 on Lisa’s shelf.

Share

From Ensenada and beyond

There have not been many posts recently, but that is not to say I have not been doing any science. Much of what I have been doing is numerous burst commits on the Panopea transcriptome paper / project. This can be found @ https://github.com/sr320/paper-pano-go.

You can see all the Jupyter nbs in this sub-directory. I will highlight some of my proudest cell moments here:

Ok the first one is so good I will just give you the whole thing except that I will provide just a little more comments ##

!wc -l analyses/Geoduck-transcriptome-v2.tab
  154407 analyses/Geoduck-transcriptome-v2.tab

!head -5 analyses/Geoduck_v2_blastn-NT.out
comp190_c0_seq1 gi  315593157   gb  CP002417.1      84.50   200 31  0   1   200 2271015 2271214 1e-47     198   Bacteria    Variovorax paradoxus EPS, complete genome   595537  Variovorax paradoxus EPS    Variovorax paradoxus EPS    b-proteobacteria
comp1900_c0_seq1    gi  481319564   gb  CP003293.1      100.00  271 0   0   1   271 1334370 1334640 1e-138    501   Bacteria    Propionibacterium acnes HL096PA1, complete genome   1134454 Propionibacterium acnes HL096PA1    Propionibacterium acnes HL096PA1    high GC Gram+
comp2164_c0_seq1    gi  221728669   gb  CP001392.1      98.47   261 4   0   1   261 721134  721394  2e-126    460   Bacteria    Acidovorax ebreus TPSY, complete genome 535289  Acidovorax ebreus TPSY  Acidovorax ebreus TPSY  b-proteobacteria
comp2742_c0_seq1    gi  527256352   ref XM_005146392.1      85.65   230 33  0   16  245 2293    2522    7e-61     243   Eukaryota   PREDICTED: Melopsittacus undulatus exostosin-like glycosyltransferase 3 (EXTL3), mRNA   13146   Melopsittacus undulatus budgerigar  birds
comp3315_c0_seq1    gi  156627645   gb  AC209228.1      79.13   206 36  6   3   202 76584   76380   1e-28     135   Eukaryota   Populus trichocarpa clone POP075-L19, complete sequence 3694    Populus trichocarpa black cottonwood    eudicots

#Lets subset above table to non Eukaryotes
#Here I am letting awk know we are dealing with tabs, 
#and I want to have all rows where column 17 is NOT Eukaryota.
!awk -F"t" '$17 != "Eukaryota" {print $1, $17 ,$15}' analyses/Geoduck_v2_blastn-NT.out 
> analyses/Non-Eukaryota-Geoduck-v2

!sort analyses/Non-Eukaryota-Geoduck-v2 > analyses/Non-Eukaryota-Geoduck-v2.sorted

!sort analyses/Geoduck-transcriptome-v2.tab > analyses/Geoduck-transcriptome-v2.sorted

!head -2 analyses/Geoduck-transcriptome-v2.sorted
comp100000_c0_seq1      TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1      GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA


# joining with -v suppressing joined lines
!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted | wc -l
  153982

!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted 
> ../data-results/Geoduck-transcriptome-v3.tab

!head -2 ../data-results/Geoduck-transcriptome-v3.tab
comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA

#Going from tab back to fasta!
!awk '{print ">"$1"n"$2}' ../data-results/Geoduck-transcriptome-v3.tab 
> ../data-results/Geoduck-transcriptome-v3.fa

!fgrep -c ">" ../data-results/Geoduck-transcriptome-v3.fa
153982

tldr:

'$c != string`  join -v  awk '{print ">"$1"n"$2}'

One more tidbit- I wanted to see how many blast hits were in the opposite direction "- frame".

thus:

!awk '($10-$9) < 0 {print $1, "t", ($10-$9)}' 
../data-results/Geoduck-tranv2-blastx_sprot.tab 
> analyses/Geoduck-tranv2-minus_direction.tab
!head analyses/Geoduck-tranv2-minus_direction.tab
!wc -l analyses/Geoduck-tranv2-minus_direction.tab

comp95_c0_seq1   -230
comp146_c0_seq1      -173
comp195_c0_seq1      -185
comp296_c0_seq1      -200
comp455_c1_seq1      -197
comp943_c0_seq1      -218
comp1059_c0_seq1     -227
comp1683_c0_seq1     -206
comp1868_c0_seq1     -308
comp1910_c1_seq1     -248
   10413 analyses/Geoduck-tranv2-minus_direction.tab
Share