Tag Archives: join

The Little Things

Getting back into gear, I am assisting Andrew ID some targets from a salmonid transcriptome. With said transcriptome I am taking the blast output and getting some protein names sans SQLshare.

The tldr can be seen here, but if you have the time I will point out the key code aspects and leave you with a tabular file.


First we had the good ol tr.

Annotation_1D4BD559.png

Then I went ahead and downloaded the newest version of Swiss-prot details


http://www.uniprot.org/uniprot/?query=reviewed%3ayes&force=yes&format=tab&columns=id,entry%20name,go-id,interactor,database(GO),go,reviewed,interpro,pathway,protein%20names,genes,tools,organism,length"

Before joining I needed to sort.

Annotation_1D4BD5EC.png

And with the join I needed a few parameters

!join -t $'t' -1 3 -2 1
blastx_sprot.sort
/Users/sr320/git-repos/nb-2016/uniprot-reviewed.sort

And because we need to get to Excel
!open blastx-join-uniprot-info.tab -a "Microsoft Excel"

Volia a tab file is created that can be examined further.

Share

From Ensenada and beyond

There have not been many posts recently, but that is not to say I have not been doing any science. Much of what I have been doing is numerous burst commits on the Panopea transcriptome paper / project. This can be found @ https://github.com/sr320/paper-pano-go.

You can see all the Jupyter nbs in this sub-directory. I will highlight some of my proudest cell moments here:

Ok the first one is so good I will just give you the whole thing except that I will provide just a little more comments ##

!wc -l analyses/Geoduck-transcriptome-v2.tab
  154407 analyses/Geoduck-transcriptome-v2.tab

!head -5 analyses/Geoduck_v2_blastn-NT.out
comp190_c0_seq1 gi  315593157   gb  CP002417.1      84.50   200 31  0   1   200 2271015 2271214 1e-47     198   Bacteria    Variovorax paradoxus EPS, complete genome   595537  Variovorax paradoxus EPS    Variovorax paradoxus EPS    b-proteobacteria
comp1900_c0_seq1    gi  481319564   gb  CP003293.1      100.00  271 0   0   1   271 1334370 1334640 1e-138    501   Bacteria    Propionibacterium acnes HL096PA1, complete genome   1134454 Propionibacterium acnes HL096PA1    Propionibacterium acnes HL096PA1    high GC Gram+
comp2164_c0_seq1    gi  221728669   gb  CP001392.1      98.47   261 4   0   1   261 721134  721394  2e-126    460   Bacteria    Acidovorax ebreus TPSY, complete genome 535289  Acidovorax ebreus TPSY  Acidovorax ebreus TPSY  b-proteobacteria
comp2742_c0_seq1    gi  527256352   ref XM_005146392.1      85.65   230 33  0   16  245 2293    2522    7e-61     243   Eukaryota   PREDICTED: Melopsittacus undulatus exostosin-like glycosyltransferase 3 (EXTL3), mRNA   13146   Melopsittacus undulatus budgerigar  birds
comp3315_c0_seq1    gi  156627645   gb  AC209228.1      79.13   206 36  6   3   202 76584   76380   1e-28     135   Eukaryota   Populus trichocarpa clone POP075-L19, complete sequence 3694    Populus trichocarpa black cottonwood    eudicots

#Lets subset above table to non Eukaryotes
#Here I am letting awk know we are dealing with tabs, 
#and I want to have all rows where column 17 is NOT Eukaryota.
!awk -F"t" '$17 != "Eukaryota" {print $1, $17 ,$15}' analyses/Geoduck_v2_blastn-NT.out 
> analyses/Non-Eukaryota-Geoduck-v2

!sort analyses/Non-Eukaryota-Geoduck-v2 > analyses/Non-Eukaryota-Geoduck-v2.sorted

!sort analyses/Geoduck-transcriptome-v2.tab > analyses/Geoduck-transcriptome-v2.sorted

!head -2 analyses/Geoduck-transcriptome-v2.sorted
comp100000_c0_seq1      TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1      GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA


# joining with -v suppressing joined lines
!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted | wc -l
  153982

!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted 
> ../data-results/Geoduck-transcriptome-v3.tab

!head -2 ../data-results/Geoduck-transcriptome-v3.tab
comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA

#Going from tab back to fasta!
!awk '{print ">"$1"n"$2}' ../data-results/Geoduck-transcriptome-v3.tab 
> ../data-results/Geoduck-transcriptome-v3.fa

!fgrep -c ">" ../data-results/Geoduck-transcriptome-v3.fa
153982

tldr:

'$c != string`  join -v  awk '{print ">"$1"n"$2}'

One more tidbit- I wanted to see how many blast hits were in the opposite direction "- frame".

thus:

!awk '($10-$9) < 0 {print $1, "t", ($10-$9)}' 
../data-results/Geoduck-tranv2-blastx_sprot.tab 
> analyses/Geoduck-tranv2-minus_direction.tab
!head analyses/Geoduck-tranv2-minus_direction.tab
!wc -l analyses/Geoduck-tranv2-minus_direction.tab

comp95_c0_seq1   -230
comp146_c0_seq1      -173
comp195_c0_seq1      -185
comp296_c0_seq1      -200
comp455_c1_seq1      -197
comp943_c0_seq1      -218
comp1059_c0_seq1     -227
comp1683_c0_seq1     -206
comp1868_c0_seq1     -308
comp1910_c1_seq1     -248
   10413 analyses/Geoduck-tranv2-minus_direction.tab
Share