Two recent tasks, including some R code!

In addition to hashing out the writing plan and starting the writing itself, which has been a terrific group effort, the past couple days have consisted of two main tasks.

First, I went through the transcriptome that had been blasted against the nucleotide database to remove any bacterial sequences with an e-value of less than 1e-200. We had agreed that these sequences would most likely represent contamination as the sequencing method targeted mRNA with polyA tails, which would not include bacteria. It turns out that there was an e-value threshold around 1e-179 before all the remaining, lowest e-values were zeros. I removed 1102 sequences with the following R code:

phel5<-read.csv(“Phel_clc_blastn_nt_edit.csv”) # this is the nucleotide blasted transcriptome that I edited to contain fewer columsn – the contig names are the key result from this code anyway so it’s fine
head(phel5)
length(phel5$e) # 10768 total
phel5$tax
summary(phel5$tax)
levels(phel5$tax)
hist(phel5$e)

# How many to take out? 1102 – same as ipython. This code makes a data frame of the REMOVED contigs.
my.data.frame2 <- subset(phel5, tax == “Bacteria” & e < 1e-200)
length(my.data.frame2$tax)
my.data.frame2$e

write.csv(my.data.frame2,file=”Phel_clc_blastn_nt_removed.csv”)

# This code makes a data frame of the final file (final data frame) without the REMOVED values: all those that aren’t bacteria + those that are bacteria but have e > 1e-200
my.data.frame <- subset(phel5, tax !=(“Bacteria”))
summary(my.data.frame$tax)
my.data.frame3 <- subset(phel5, tax ==”Bacteria” & e > 1e-200)
my.data.frame4<-rbind(my.data.frame,my.data.frame3)
length(my.data.frame4$tax) # 9666

#length of #4 should = 10768-1102 = 9666
10768-1102

# Write final file to directory
write.csv(my.data.frame4,file=”Phel_clc_blastn_nt_nobact.csv”)

Steven then checked the fasta file to see whether this process had removed the unusually large contains, which was mostly the case. Ultimately, this process led to removing approximately 1000 contains from the transcriptome file of ~ 30,000 contains that we have been using as the basis of our analyses.

  • This will affect our characterization of the transcriptome (e.g. species distributions, gene annotations), which is good timing as we’re working on that now
  • It will not affect the enrichment analyses based on DEGs with SPIDs as none of the bacterial sequences removed belonged to the SPID-annotated DEGs

The second major thing I’ve been working on is sorting through the differentially expressed genes in the enriched processes that map to the Toll pathway. This has been a good way to learn a lot more about the details of the Toll pathway, and above all the variability between species as well as the gaps in the knowledge!

Leave a Reply