Tag Archives: blast

The Little Things

Getting back into gear, I am assisting Andrew ID some targets from a salmonid transcriptome. With said transcriptome I am taking the blast output and getting some protein names sans SQLshare.

The tldr can be seen here, but if you have the time I will point out the key code aspects and leave you with a tabular file.


First we had the good ol tr.

Annotation_1D4BD559.png

Then I went ahead and downloaded the newest version of Swiss-prot details


http://www.uniprot.org/uniprot/?query=reviewed%3ayes&force=yes&format=tab&columns=id,entry%20name,go-id,interactor,database(GO),go,reviewed,interpro,pathway,protein%20names,genes,tools,organism,length"

Before joining I needed to sort.

Annotation_1D4BD5EC.png

And with the join I needed a few parameters

!join -t $'t' -1 3 -2 1
blastx_sprot.sort
/Users/sr320/git-repos/nb-2016/uniprot-reviewed.sort

And because we need to get to Excel
!open blastx-join-uniprot-info.tab -a "Microsoft Excel"

Volia a tab file is created that can be examined further.

Share

From Ensenada and beyond

There have not been many posts recently, but that is not to say I have not been doing any science. Much of what I have been doing is numerous burst commits on the Panopea transcriptome paper / project. This can be found @ https://github.com/sr320/paper-pano-go.

You can see all the Jupyter nbs in this sub-directory. I will highlight some of my proudest cell moments here:

Ok the first one is so good I will just give you the whole thing except that I will provide just a little more comments ##

!wc -l analyses/Geoduck-transcriptome-v2.tab
  154407 analyses/Geoduck-transcriptome-v2.tab

!head -5 analyses/Geoduck_v2_blastn-NT.out
comp190_c0_seq1 gi  315593157   gb  CP002417.1      84.50   200 31  0   1   200 2271015 2271214 1e-47     198   Bacteria    Variovorax paradoxus EPS, complete genome   595537  Variovorax paradoxus EPS    Variovorax paradoxus EPS    b-proteobacteria
comp1900_c0_seq1    gi  481319564   gb  CP003293.1      100.00  271 0   0   1   271 1334370 1334640 1e-138    501   Bacteria    Propionibacterium acnes HL096PA1, complete genome   1134454 Propionibacterium acnes HL096PA1    Propionibacterium acnes HL096PA1    high GC Gram+
comp2164_c0_seq1    gi  221728669   gb  CP001392.1      98.47   261 4   0   1   261 721134  721394  2e-126    460   Bacteria    Acidovorax ebreus TPSY, complete genome 535289  Acidovorax ebreus TPSY  Acidovorax ebreus TPSY  b-proteobacteria
comp2742_c0_seq1    gi  527256352   ref XM_005146392.1      85.65   230 33  0   16  245 2293    2522    7e-61     243   Eukaryota   PREDICTED: Melopsittacus undulatus exostosin-like glycosyltransferase 3 (EXTL3), mRNA   13146   Melopsittacus undulatus budgerigar  birds
comp3315_c0_seq1    gi  156627645   gb  AC209228.1      79.13   206 36  6   3   202 76584   76380   1e-28     135   Eukaryota   Populus trichocarpa clone POP075-L19, complete sequence 3694    Populus trichocarpa black cottonwood    eudicots

#Lets subset above table to non Eukaryotes
#Here I am letting awk know we are dealing with tabs, 
#and I want to have all rows where column 17 is NOT Eukaryota.
!awk -F"t" '$17 != "Eukaryota" {print $1, $17 ,$15}' analyses/Geoduck_v2_blastn-NT.out 
> analyses/Non-Eukaryota-Geoduck-v2

!sort analyses/Non-Eukaryota-Geoduck-v2 > analyses/Non-Eukaryota-Geoduck-v2.sorted

!sort analyses/Geoduck-transcriptome-v2.tab > analyses/Geoduck-transcriptome-v2.sorted

!head -2 analyses/Geoduck-transcriptome-v2.sorted
comp100000_c0_seq1      TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1      GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA


# joining with -v suppressing joined lines
!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted | wc -l
  153982

!join -v 1 analyses/Geoduck-transcriptome-v2.sorted 
analyses/Non-Eukaryota-Geoduck-v2.sorted 
> ../data-results/Geoduck-transcriptome-v3.tab

!head -2 ../data-results/Geoduck-transcriptome-v3.tab
comp100000_c0_seq1 TGAATGTATGTTTGTGAACGTATGTATATGAATGTATGTATGTGAATGCATACCATCTGTATAAGTATAATCCGACCGGGAGATGTTTATTCACACAGTTTGGCATTATGACGTTTAACCTCTGAATTGGCGTCTCTTGCTACTGACCGCTTCACAGTGATGACCCATGTTGTCACTTCTAATCTATTTATGTATTGCTCTTTTATATTATACTATTAACGCTGTTAAAATACACTACCGCTAAACAAATAAGAGTTGTGGGTTTGAATCATTGGTGAGTGCAGGAACCTCAGCAGGTCATTAAGATTTACGTGTACGTCTATCCTAAACCTACATGTTTCAACTTTGTTGTTTTTCGGTTTCGTTCTCTGTACACAGCTCTTGAAACATACATAAAATACCATTTTATTAAAAAATATGTCTCTATTTAATGTTAAAACCTTTTTAAGAAAA
comp100001_c1_seq1 GCTTTACCAGTTGTTACAAACATTTTAATAGTTATAGTTAATATACACAACATTTTAAATTAACTAGTGTCGAGAACTTGAGTCGGACATAGAGAATTAAATGTTTGTTGAACTTTAGCCAAGCACTTTTATTCTATTACTTTTTGAAGTATTTAATACCTTAAAATAATGGAATACTCCTGTAGAGTCCTTGAAGCCATCAACAATTTACCAACCTCCAAATAAAATATGAATATATTTTACATGATGAATTTACATAATGGATATATCATTGATATCTGCCAAGTTAACTTCACCTACCATTTTTAAGCTTACTTTGACCATGTTAGTTGGTATTGTGTATATAACGAGTGGGAGGACATTCATACCTGGCATTTGTTTGGTCAAACTGACACAAGATTTATGTTTATTTCAAACCTATATATAAAACAAGTCTCAATGAATATCTTCCTAGGCACAAGACAATGCTGATAAAATGTCTTGTTCAAGGACA

#Going from tab back to fasta!
!awk '{print ">"$1"n"$2}' ../data-results/Geoduck-transcriptome-v3.tab 
> ../data-results/Geoduck-transcriptome-v3.fa

!fgrep -c ">" ../data-results/Geoduck-transcriptome-v3.fa
153982

tldr:

'$c != string`  join -v  awk '{print ">"$1"n"$2}'

One more tidbit- I wanted to see how many blast hits were in the opposite direction "- frame".

thus:

!awk '($10-$9) < 0 {print $1, "t", ($10-$9)}' 
../data-results/Geoduck-tranv2-blastx_sprot.tab 
> analyses/Geoduck-tranv2-minus_direction.tab
!head analyses/Geoduck-tranv2-minus_direction.tab
!wc -l analyses/Geoduck-tranv2-minus_direction.tab

comp95_c0_seq1   -230
comp146_c0_seq1      -173
comp195_c0_seq1      -185
comp296_c0_seq1      -200
comp455_c1_seq1      -197
comp943_c0_seq1      -218
comp1059_c0_seq1     -227
comp1683_c0_seq1     -206
comp1868_c0_seq1     -308
comp1910_c1_seq1     -248
   10413 analyses/Geoduck-tranv2-minus_direction.tab
Share

Sifting Sands

Below is a quick workflow I am using to help Drinan annotate ~1.5 million sequences from an amplicon targeting NGS effort of sand.

 head /Users/sr320/Dropbox/hummingbird-ipython-nbs/data/DanD/meiofauna_forward_sequences.fa
 >M02215:33:000000000-AFA9E:1:1101:14961:2005 1:N:0:15
 TGACTGTGCTAAGGTAGCATAATTAATTGTCTTTTAATTAGAGACTTGTTTGAAAGATTT
 TTTGAATTTAATATAGTTTTAAAATTATAAAAATGAATTTTTATATATTGGTAAAAATAC
 CATGATTTTTTAAAAAGACGATAAGACCCTATCAAGTTTTACTTAAATTTAAAGAAAATT
 TAGGTTTTAATGGGGCATTATTATTTATTTTAAATAAATTTTGATCTTAAATTAAATTTT
 AGGAAATTTAATAAAATTACTGTAGGGATAACAGTGTAATATTTTTTAAAGTTCATATTT
 A
 >M02215:33:000000000-AFA9E:1:1101:11050:2011 1:N:0:15
 TAACTGTGCTAAGGTAGCATAATCACTTGTCTCCTAATTAGAGACTGGCATGAAAGGGTA
 AACTCTTTATAACTTTATAAAGCATACACACTGAAATTTTTATTTAGACGAAGAAATCTA

 

 

 

Within a given working directory I proceeded to (in Jupyter NB)

cp meiofauna_forward_sequences.fa query.fasta – to simply rename.

!/Users/steven/Dropbox/hummingbird-ipython-nbs/script-box/fasta-splitter.pl
--n-parts 20
query.fasta
– to split. This was if failure occurs, simple restart.

a little magic

 %%bash
 for f in query.part*
 do
 blastn 
 -query $f 
 -db /Volumes/Data/blast_db/nt 
 -evalue 1e-5 
 -max_target_seqs 1 
 -max_hsps 1 
 -outfmt "6 std sskingdoms stitle staxids sscinames scomnames sblastnames" 
 -num_threads 14 
 -out blastout_"$f"_nt
 done

 

Spiced it up with output format-

from manual:

 

 Options 6, 7, and 10 can be additionally configured to produce
 a custom format specified by space delimited format specifiers.
 The supported format specifiers are:
 qseqid means Query Seq-id
 qgi means Query GI
 qacc means Query accesion
 qaccver means Query accesion.version
 qlen means Query sequence length
 sseqid means Subject Seq-id
 sallseqid means All subject Seq-id(s), separated by a ';'
 sgi means Subject GI
 sallgi means All subject GIs
 sacc means Subject accession
 saccver means Subject accession.version
 sallacc means All subject accessions
 slen means Subject sequence length
 qstart means Start of alignment in query
 qend means End of alignment in query
 sstart means Start of alignment in subject
 send means End of alignment in subject
 qseq means Aligned part of query sequence
 sseq means Aligned part of subject sequence
 evalue means Expect value
 bitscore means Bit score
 score means Raw score
 length means Alignment length
 pident means Percentage of identical matches
 nident means Number of identical matches
 mismatch means Number of mismatches
 positive means Number of positive-scoring matches
 gapopen means Number of gap openings
 gaps means Total number of gaps
 ppos means Percentage of positive-scoring matches
 frames means Query and subject frames separated by a '/'
 qframe means Query frame
 sframe means Subject frame
 btop means Blast traceback operations (BTOP)
 staxids means unique Subject Taxonomy ID(s), separated by a ';'
 (in numerical order)
 sscinames means unique Subject Scientific Name(s), separated by a ';'
 scomnames means unique Subject Common Name(s), separated by a ';'
 sblastnames means unique Subject Blast Name(s), separated by a ';'
 (in alphabetical order)
 sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
 (in alphabetical order)
 stitle means Subject Title
 salltitles means All Subject Title(s), separated by a '<>'
 sstrand means Subject Strand
 qcovs means Query Coverage Per Subject
 qcovhsp means Query Coverage Per HSP
 When not provided, the default value is:
 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
 evalue bitscore', which is equivalent to the keyword 'std'

 

 

and sample output:

 

M02215:33:000000000-AFA9E:1:1103:16706:22078 gi|56550013|gb|AY803660.1| 85.62 160 17 6 145 299 30 188 7e-37 163 Eukaryota Hydrothelphusa madagascariensis 18S ribosomal RNA gene, partial sequence 168669 Hydrothelphusa madagascariensis Hydrothelphusa madagascariensis crustaceans
M02215:33:000000000-AFA9E:1:1103:16318:23659 gi|393186538|gb|JX083886.1| 82.50 280 37 11 25 297 1 275 2e-58 235 Eukaryota Microeuraphia sp. 1 MPL-2012 isolate KACb37 16S ribosomal RNA gene, partial sequence; mitochondrial 1204330 Microeuraphia sp. 1 MPL-2012 Microeuraphia sp. 1 MPL-2012 crustaceans
M02215:33:000000000-AFA9E:1:1103:23857:24453 gi|386786433|gb|JQ435298.1| 96.23 53 2 0 248 300 1545 1597 5e-14 87.9 Eukaryota Cf. Traiania sp. DNA106167 voucher MCZ DNA106167 28S ribosomal RNA gene, partial sequence 1183147 cf. Traiania sp. DNA106167 cf. Traiania sp. DNA106167 daddy longlegs
M02215:33:000000000-AFA9E:1:1103:12121:24652 gi|302140702|gb|GQ343306.1| 98.31 295 5 0 6 300 22 316 1e-143 518 Eukaryota Evadne nordmanni isolate E37/5 16S ribosomal RNA gene, partial sequence; mitochondrial 141403 Evadne nordmanni Evadne nordmanni crustaceans
M02215:33:000000000-AFA9E:1:1104:23888:3062 gi|94960351|gb|DQ467789.1| 98.25 57 1 0 1 57 60 116 6e-18 100 Eukaryota Lynceus macleyanus isolate 53 16S ribosomal RNA gene, partial sequence; mitochondrial 381959 Lynceus macleyanus Lynceus macleyanus crustaceans
M02215:33:000000000-AFA9E:1:1104:11382:3077 gi|343168999|gb|JN018352.1| 84.52 310 36 9 1 300 2729 3036 7e-77 296 Eukaryota Ischyropsalis pyrenaea voucher MNHN-JAA33 28S ribosomal RNA gene, partial sequence 1046795 Ischyropsalis pyrenaea Ischyropsalis pyrenaea daddy longlegs
M02215:33:000000000-AFA9E:1:1104:16849:3145 gi|472441035|gb|KC529449.1| 86.77 310 29 12 1 301 558 864 1e-88 335 Eukaryota Microdalyellia nanella isolate W43ss 18S ribosomal RNA gene, partial sequence 1311903 Microdalyellia nanella Microdalyellia nanella flatworms
M02215:33:000000000-AFA9E:1:1104:15899:3944 gi|349592295|gb|JN205453.1| 92.12 292 21 1 12 301 1 292 2e-111 411 N/A Uncultured organism clone KCON28S38 28S ribosomal RNA gene, partial sequence 155900 uncultured organism uncultured organism N/A
M02215:33:000000000-AFA9E:1:1104:14928:4021 gi|46812207|gb|AY569664.1| 87.10 310 23 15 1 301 574 875 1e-88 335 Eukaryota Arenicolides ecaudata 18S ribosomal RNA gene, partial sequence 273060 Arenicolides ecaudata Arenicolides ecaudata segmented worms
M02215:33:000000000-AFA9E:1:1104:19218:4051 gi|301178327|gb|HM799910.1| 96.47 85 2 1 1 85 550 633 1e-29 139 Eukaryota Uncultured marine metazoan clone PRTBE7499 small subunit ribosomal RNA gene, partial sequence 329654 uncultured marine metazoan uncultured marine metazoan animals
M02215:33:000000000-AFA9E:1:1104:19414:4162 gi|295805440|emb|FN389538.1| 95.24 84 3 1 1 84 21 103 2e-27 132 Eukaryota Uncultured Glomus 18S ribosomal RNA gene, isolate, partial sequence. 07ED8BM 231055 uncultured Glomus uncultured Glomus glomeromycetes
M02215:33:000000000-AFA9E:1:1104:18162:4373 gi|342210101|gb|JF277589.1| 90.52 306 21 7 1 299 67 371 2e-107 398 Eukaryota Nemertean sp. 2 SA-2011 voucher MCZ DNA106139 16S ribosomal RNA gene, partial sequence; mitochondrial 947588 Nemertean sp. 2 SA-2011 Nemertean sp. 2 SA-2011 ribbon worms
M02215:33:000000000-AFA9E:1:1104:25946:4764 gi|110535863|gb|DQ665998.1| 84.92 305 37 9 1 301 552 851 5e-78 300 Eukaryota Phagocata vitta 18S ribosomal RNA gene, complete sequence 391283 Phagocata vitta Phagocata vitta flatworms
M02215:33:000000000-AFA9E:1:1104:26451:5022 gi|307647639|gb|HM564573.1| 93.71 302 18 1 1 301 470 771 1e-123 451 Eukaryota Phanodermatidae sp. JCC52 18S ribosomal RNA gene, partial sequence 883396 Phanodermatidae sp. JCC52 Phanodermatidae sp. JCC52 nematodes
M02215:33:000000000-AFA9E:1:1104:14581:5493 gi|472441081|gb|KC529495.1| 86.36 308 33 9 1 301 559 864 2e-86 327 Eukaryota Dochmiotrema limicola isolate UH80.2 18S ribosomal RNA gene, partial sequence 1311982 Dochmiotrema limicola Dochmiotrema limicola flatworms
M02215:33:000000000-AFA9E:1:1104:9527:5655 gi|170672378|gb|EU376009.1| 87.06 309 32 6 1 301 2834 3142 9e-91 342 Eukaryota Craterostigmus tasmanianus 28S ribosomal RNA gene, partial sequence 60162 Craterostigmus tasmanianus Craterostigmus tasmanianus centipedes
M02215:33:000000000-AFA9E:1:1104:15515:5664 gi|373860124|gb|HQ865054.1| 86.75 302 34 6 2 301 7 304 2e-87 331 Eukaryota Uncultured eukaryote clone SGPX651 18S ribosomal RNA gene, partial sequence 100272 uncultured eukaryote uncultured eukaryote eukaryotes
M02215:33:000000000-AFA9E:1:1104:25827:5676 gi|154101573|gb|EF552055.1| 99.34 301 2 0 1 301 68 368 6e-152 545 Eukaryota Balanus glandula isolate 2 16S ribosomal RNA gene, partial sequence; mitochondrial 110520 Balanus glandula Balanus glandula crustaceans
M02215:33:000000000-AFA9E:1:1104:22156:5763 gi|157787506|gb|EF990727.1| 89.11 303 27 4 1 300 2446 2745 1e-99 372 Eukaryota Rhabditoides inermiformis strain SB328 28S large subunit ribosomal RNA gene, partial sequence 96653 Rhabditoides inermiformis Rhabditoides inermiformis nematodes
M02215:33:000000000-AFA9E:1:1104:14633:5939 gi|213959396|gb|FJ426630.1| 78.43 306 53 12 1 301 47 344 4e-44 187 Eukaryota Brachionus angularis voucher S. H. Cheng 001 16S ribosomal RNA gene, partial sequence; mitochondrial 396692 Brachionus angularis Brachionus angularis rotifers
M02215:33:000000000-AFA9E:1:1104:10324:6174 gi|386696111|gb|JQ000284.1| 79.22 308 52 11 2 301 537 840 4e-49 204 Eukaryota Xolalgidae gen. sp. AD1204 18S ribosomal RNA gene, partial sequence 1111400 Xolalgidae gen. sp. AD1204 Xolalgidae gen. sp. AD1204 mites &amp; ticks
M02215:33:000000000-AFA9E:1:1104:13062:6188 gi|74100231|gb|DQ186202.1| 87.97 291 25 7 1 284 13348 13635 1e-88 335 Eukaryota Thalassiosira pseudonana mitochondrion, complete genome 35128 Thalassiosira pseudonana Thalassiosira pseudonana diatoms
M02215:33:000000000-AFA9E:1:1104:13172:6214 gi|109390598|emb|AM039747.1| 82.95 305 39 13 1 298 2492 2790 7e-67 263 Eukaryota Heligmosomoides polygyrus 28S rRNA gene 6339 Heligmosomoides polygyrus Heligmosomoides polygyrus nematodes

Will end up cating 20 output files once done.

Share

BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

Jupyter (IPython) Notebook : 20150501_Cgigas_larvae_OA_BLASTn_nt.ipynb

Notebook Viewer : 20150501_Cgigas_larvae_OA_BLASTn_nt

Results:

BLASTn Output File: 20150501_nt_blastn.tab

BLAST e-vals <= 0.001: 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Unique BLAST Species: 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

 

Firstly, since this library was bisulfite converted, we know that matching won’t be as robust as we’d normally see.

However, the BLAST matches for this are terrible.

Only 0.65% of the BLAST matches (e-value <0.001) are to Crassostrea gigas. Yep, you read that correctly: 0.65%.

It’s nearly 40-fold less than the top species: Dictyostelium discoideum (a slime mold)

It’s 30-fold less than the next species: Danio rerio (zebra fish)

Then it’s followed up by human and mouse.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

Share

BLASTN – C.gigas OA Larvae to C.gigas Ensembl 1.24 BLAST DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

I also created a nucleotide BLAST database (DB) from the Crassostrea_gigas.GCA_000297895.1.24.fa

Jupyter (IPython) Notebook: 20150429_Gigas_larvae_OA_BLASTn.ipynb

Notebook Viewer: 20150429_Gigas_larvae_OA_BLASTn

 

Results:

The results are not great.

All query contigs successfully BLAST to sequences in the C.gigas Ensembl BLAST DB. However, only 33 of the sequences (out of ~37,000) have an e-value of 0.0. The next best e-value for any matches is 0.001. For the uninitiated, that value is not very good, especially when you’re BLASTing against the same exact species DB.

Will BLASTn the C.gigas contigs against the entire GenBank nt (all nucleotides) to see what the taxonomy breakdown looks like of these sequences.

Share

Annotation 101

DIRECT LINK to nbviewer

http://nbviewer.ipython.org/github/sr320/eimd/blob/master/ipynbs/c_Annotation.ipynb


 

Annotation (Blast)

In [1]:
#Contigs from Assembly
!head /Volumes/web/cnidarian/SeaStar_transc_v2.fa
>3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_contig_1

CAAATATATGAACGGTTGATTGTCAACGATTAGTACATGTTTTCATTGTTCCCCACGCCC

GCCCCCCCCCACTCAAACATTTAAAGTGTGAAATATTATTTATCCACAAATTTCCTTAAA

CCTGCAAACTTGTCTGCTGTCTCTTATTGGAAGTTATGAAAAAGAACAACGGGTTTTCTT

TAAAGGGTCTGCGTGCGATTTTCAACCTTTTGAGTAATAGCAGTTATTTTGATAACCGAT

TTTTTTCAAAGCTCAACAGCTTTTTAAAATAAGGAATCCTATAATGGCCAAACGAATACT

ATAAAAATAAGGGTTCTCTTAATTGTATAAAACGTATAATTTTATCAATTTTGGGACCGT

GTAATTTTTTAAAGACCACAAGAATGTTACATACAACAAATAGACGAAACTCGTAGCTTT

GGAAACTACGTCATGGGCGTTTGGTCAAAAGCTGGAGAGAAAGAGAGGTGGGGTGCCAGA

CTTAAGTAGTCACGTGATCTGACCAACGCACATCGGAAGCTCGATCGGATGAAATCTTCT


In [2]:
#Making the query look better
!sed 's/3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)/Phel_clc/g' </Volumes/web/cnidarian/SeaStar_transc_v2.fa> /Users/sr320/Dropbox/Steven/eimd/data/Phel_transcriptome_clc.fa
In [3]:
!head /Users/sr320/Dropbox/Steven/eimd/data/Phel_transcriptome_clc.fa
>Phel_clc_contig_1

CAAATATATGAACGGTTGATTGTCAACGATTAGTACATGTTTTCATTGTTCCCCACGCCC

GCCCCCCCCCACTCAAACATTTAAAGTGTGAAATATTATTTATCCACAAATTTCCTTAAA

CCTGCAAACTTGTCTGCTGTCTCTTATTGGAAGTTATGAAAAAGAACAACGGGTTTTCTT

TAAAGGGTCTGCGTGCGATTTTCAACCTTTTGAGTAATAGCAGTTATTTTGATAACCGAT

TTTTTTCAAAGCTCAACAGCTTTTTAAAATAAGGAATCCTATAATGGCCAAACGAATACT

ATAAAAATAAGGGTTCTCTTAATTGTATAAAACGTATAATTTTATCAATTTTGGGACCGT

GTAATTTTTTAAAGACCACAAGAATGTTACATACAACAAATAGACGAAACTCGTAGCTTT

GGAAACTACGTCATGGGCGTTTGGTCAAAAGCTGGAGAGAAAGAGAGGTGGGGTGCCAGA

CTTAAGTAGTCACGTGATCTGACCAACGCACATCGGAAGCTCGATCGGATGAAATCTTCT


In [4]:
!fgrep -c ">" /Users/sr320/Dropbox/Steven/eimd/data/Phel_transcriptome_clc.fa
30578


In []:
#IMPORTANT the location of files will change depending on your computer
blastx 
-query /Users/sr320/Dropbox/Steven/eimd/data/Phel_transcriptome_clc.fa 
-db /Users/Shared/data/blast/db/uniprot_sprot 
-out /Users/sr320/Dropbox/Steven/eimd/data/Phel_clc_blastx_uniprot_sprot_1.tab 
-outfmt 6 
-num_threads 8 
-max_hsps 1 
-max_target_seqs 1 
-evalue 1E-20
In [11]:
!head /Users/sr320/Dropbox/Steven/eimd/data/Phel_clc_blastx_uniprot_sprot_1.tab
Phel_clc_contig_4	sp|P25001|COX1_PISOC	88.39	517	48	1	7061	5547	1	517	0.0	  749

Phel_clc_contig_7	sp|Q33818|CYB_ASTPE	79.94	329	66	0	993	7	51	379	9e-168	  479

Phel_clc_contig_8	sp|P68037|UB2L3_MOUSE	76.97	152	35	0	4862	4407	1	152	7e-62	  214

Phel_clc_contig_9	sp|Q0MVN8|QOR_PIG	45.61	239	129	1	796	80	90	327	5e-63	  210

Phel_clc_contig_17	sp|Q6DGL8|RT15_DANRE	35.00	180	107	3	1177	638	61	230	5e-22	99.8

Phel_clc_contig_18	sp|P96202|PPSC_MYCTU	30.81	714	438	15	5407	3386	1414	2111	6e-76	  286

Phel_clc_contig_20	sp|P46058|EDSP_CYNPY	31.03	348	218	8	1731	703	4	334	5e-38	  148

Phel_clc_contig_22	sp|Q96LI5|CNO6L_HUMAN	60.78	357	128	5	1887	832	186	535	2e-149	  450

Phel_clc_contig_24	sp|P63245|GBLP_RAT	80.77	312	60	0	1032	97	1	312	0.0	  540

Phel_clc_contig_25	sp|Q9QYP1|LRP4_RAT	32.74	623	393	7	1816	5	393	1008	1e-99	  339


In [12]:
!tr '|' "t" </Users/sr320/Dropbox/Steven/eimd/data/Phel_clc_blastx_uniprot_sprot_1.tab> /Users/sr320/Dropbox/Steven/eimd/data/Phel_clc_blastx_uniprot_sprot_sqlready.tab
In [13]:
!head /Users/sr320/Dropbox/Steven/eimd/data/Phel_clc_blastx_uniprot_sprot_sqlready.tab
Phel_clc_contig_4	sp	P25001	COX1_PISOC	88.39	517	48	1	7061	5547	1	517	0.0	  749

Phel_clc_contig_7	sp	Q33818	CYB_ASTPE	79.94	329	66	0	993	7	51	379	9e-168	  479

Phel_clc_contig_8	sp	P68037	UB2L3_MOUSE	76.97	152	35	0	4862	4407	1	152	7e-62	  214

Phel_clc_contig_9	sp	Q0MVN8	QOR_PIG	45.61	239	129	1	796	80	90	327	5e-63	  210

Phel_clc_contig_17	sp	Q6DGL8	RT15_DANRE	35.00	180	107	3	1177	638	61	230	5e-22	99.8

Phel_clc_contig_18	sp	P96202	PPSC_MYCTU	30.81	714	438	15	5407	3386	1414	2111	6e-76	  286

Phel_clc_contig_20	sp	P46058	EDSP_CYNPY	31.03	348	218	8	1731	703	4	334	5e-38	  148

Phel_clc_contig_22	sp	Q96LI5	CNO6L_HUMAN	60.78	357	128	5	1887	832	186	535	2e-149	  450

Phel_clc_contig_24	sp	P63245	GBLP_RAT	80.77	312	60	0	1032	97	1	312	0.0	  540

Phel_clc_contig_25	sp	Q9QYP1	LRP4_RAT	32.74	623	393	7	1816	5	393	1008	1e-99	  339


In []:

 

Share

Sequencing – PGS Hi 4 (PGS2/COX2)

Sent plasmid prep to ASU (5uL of plasmid + 1uL of 10uM M13F/R). SJW01 = M13F, SJW02 = M13R.

Results:

Sequencing looks great! Definitely have a portion of the second isoform of COX/PGS!! Here’s the result of the consensus BLASTed in GenBank>Nucleotide (others)>blastn:

Top hit in the db is COX1/PGS1, and, clearly, there are differences between the two sequences confirming that we have the second isoform (COX2/PGS2). Will design more RACE primers in hopes of obtaining the full-length cDNA sequence.

Share

qPCR – Emma’s New 3KDSqPCR Primers

Due to previous contamination issues with Emma’s primers, Emma asked me to order new primers, reconstitute them and run a qPCR for her to see if we could eliminate her contamination issues with this primer set. cDNA template was supplied by Emma (from 2/2/11) and was from a C.gigas 3hr Vibrio vulnificus challenge. Samples were run in duplicate, as requested. Master mix calcs are here. Plate layout, cycling params, etc. can be found in the qPCR Report (see Results). Primer set used was:

Cg_3KDSqPCR_F/R (SR IDs: 1186, 1187)

Results:

qPCR Data File (BioRad CFX96)

qPCR Report (PDF)

The negative controls (NTC) are negative, meaning they do not cross the threshold set by the BioRad software. However, there is clearly amplification in the NTCs, but they come up late enough that they do not cross the threshold and, thus, generate a Cq value. Additionally, the melt curve reveals peaks in the NTCs that are at the same melting temperature as the product produced in the cDNA qPCR reactions. This would potentially imply some sort of contamination, as Emma has experienced.

Honestly, I do no think contamination is the problem. I believe that the “contamination” being seen in the NTCs is actually primer dimer. Increasing the annealing temperature (I’m not sure if Emma tried this during her troubleshooting) could potentially alleviate this issue. However, I’m not sure she’s amplifying the target that she wants to. Based on my analysis, I think she needs to re-design primers for her 3KDS target. Read my analysis and why I came to this conclusion below.

It seems unlikely that two independent people (and multiple primer stock replacements!) would have contamination, so I looked in to things a bit further.

I BLASTed the primer sets (NCBI, blastn, est_others db, C.gigas only) and the BLAST results reveal the primers matching with a C.gigas EST sequence that would produce a band of only 63bp. Here’s a screen capture of the BLAST results:

This result does NOT agree with what is entered in our Primer Database. As entered in our sheet, the expected PCR product would be ~102bp. However, taking in to account the BLAST results, it would be difficult to distinguish the difference between primer dimers and PCR product in a melt curve analysis.

Emma has previously run a conventional PCR with these primers and ran a gel (see below). At the time, it was thought to be contamination, but in retrospect (knowing the results of the qPCRs and the BLAST results) it seems likely that what she’s seeing in the negative controls was actually primer dimer, which was the same size of her PCR product (which she thought should be larger). Additionally, the gel was difficult to interpret because no ladder was run. A ladder might have revealed that her PCR product was half the size that she was expecting:

Share