Tag Archives: awk

Data Analysis – Identification of duplicate files on Eagle

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

 

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

 

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

 

Here are the cleaned output files from the fslint analysis:

 

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

Share

From Ensenada and beyond

There have not been many posts recently, but that is not to say I have not been doing any science. Much of what I have been doing is numerous burst commits on the Panopea transcriptome paper / project. This can be found @ https://github.com/sr320/paper-pano-go.

You can see all the Jupyter nbs in this sub-directory. I will highlight some of my proudest cell moments here:

Ok the first one is so good I will just give you the whole thing except that I will provide just a little more comments ##

!wc -l analyses/Geoduck-transcriptome-v2.tab
  154407 analyses/Geoduck-transcriptome-v2.tab

!head -5 analyses/Geoduck_v2_blastn-NT.out
comp190_c0_seq1 gi  315593157   gb  CP002417.1      84.50   200 31  0   1   200 2271015 2271214 1e-47     198   Bacteria    Variovorax paradoxus EPS, complete genome   595537  Variovorax paradoxus EPS    Variovorax paradoxus EPS    b-proteobacteria
comp1900_c0_seq1    gi  481319564   gb  CP003293.1      100.00  271 0   0   1   271 1334370 1334640 1e-138    501   Bacteria    Propionibacterium acnes HL096PA1, complete genome   1134454 Propionibacterium acnes HL096PA1    Propionibacterium acnes HL096PA1    high GC Gram+
comp2164_c0_seq1    gi  221728669   gb  CP001392.1      98.47   261 4   0   1   261 721134  721394  2e-126    460   Bacteria    Acidovorax ebreus TPSY, complete genome 535289  Acidovorax ebreus TPSY  Acidovorax ebreus TPSY  b-proteobacteria
comp2742_c0_seq1    gi  527256352   ref XM_005146392.1      85.65   230 33  0   16  245 2293    2522    7e-61     243   Eukaryota   PREDICTED: Melopsittacus undulatus exostosin-like glycosyltransferase 3 (EXTL3), mRNA   13146   Melopsittacus undulatus budgerigar  birds
comp3315_c0_seq1    gi  156627645   gb  AC209228.1      79.13   206 36  6   3   202 76584   76380   1e-28     135   Eukaryota   Populus trichocarpa clone POP075-L19, complete sequence 3694    Populus trichocarpa black cottonwood    eudicots

#Lets subset above table to non Eukaryotes
#Here I am letting awk know we are dealing with tabs, 
#and I want to have all rows where column 17 is NOT Eukaryota.
!awk -F"t" '$17 != "Eukaryota" {print $1, $17 ,$15}' analyses/Geoduck_v2_blastn-NT.out 
> analyses/Non-Eukaryota-Geoduck-v2

!sort analyses/Non-Eukaryota-Geoduck-v2 > analyses/Non-Eukaryota-Geoduck-v2.sorted

!sort analyses/Geoduck-transcriptome-v2.tab > analyses/Geoduck-transcriptome-v2.sorted

!head -2 analyses/Geoduck-transcriptome-v2.sorted
comp100000_c0_seq1      TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1      GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA


# joining with -v suppressing joined lines
!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted | wc -l
  153982

!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted 
> ../data-results/Geoduck-transcriptome-v3.tab

!head -2 ../data-results/Geoduck-transcriptome-v3.tab
comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA

#Going from tab back to fasta!
!awk '{print ">"$1"n"$2}' ../data-results/Geoduck-transcriptome-v3.tab 
> ../data-results/Geoduck-transcriptome-v3.fa

!fgrep -c ">" ../data-results/Geoduck-transcriptome-v3.fa
153982

tldr:

'$c != string`  join -v  awk '{print ">"$1"n"$2}'

One more tidbit- I wanted to see how many blast hits were in the opposite direction "- frame".

thus:

!awk '($10-$9) < 0 {print $1, "t", ($10-$9)}' 
../data-results/Geoduck-tranv2-blastx_sprot.tab 
> analyses/Geoduck-tranv2-minus_direction.tab
!head analyses/Geoduck-tranv2-minus_direction.tab
!wc -l analyses/Geoduck-tranv2-minus_direction.tab

comp95_c0_seq1   -230
comp146_c0_seq1      -173
comp195_c0_seq1      -185
comp296_c0_seq1      -200
comp455_c1_seq1      -197
comp943_c0_seq1      -218
comp1059_c0_seq1     -227
comp1683_c0_seq1     -206
comp1868_c0_seq1     -308
comp1910_c1_seq1     -248
   10413 analyses/Geoduck-tranv2-minus_direction.tab
Share