Tag Archives: trinity

Transcriptome Assembly – Olympia oyster RNAseq Data with Trinity

Used all of our current oly RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Despite the naming conventions, this job was submitted to the Mox scheduler on 201800912 and finished on 20180913.

After job completion, the entire folder was gzipped, using an interactive node (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_oly_RNAseq | pigz > 20180827_trinity_oly_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll follow up on this GitHub issue and get some bedgraphs generated.

Share

Transcriptome Assembly – Geoduck RNAseq data

Used all of our current geoduck RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node. Specifically, I had to use just a single node with 500GB of RAM. Trinity could not run with much less than that. Initially, I attempted to run with two nodes, but our smaller node (120GB) ended up limiting the available RAM (the system only uses the RAM available on the smallest node; it cannot combine RAM or dynamically allocate computing to a node with larger RAM when needed) and Trinity consistently crashed due to memory limitations.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Due to the huge number of input files, I won’t post the entire script contents here. Instead, here’s a snippet of the script showing the commands used to start the Trinity run:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180829_trinity
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180827_trinity_geoduck_RNAseq

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \n >> system_path.log


# Run Trinity
/gscratch/srlab/programs/trinityrnaseq-Trinity-v2.8.3/Trinity 
--trimmomatic 
--seqType fq 
--max_memory 500G 
--CPU 28 

Despite the naming conventions, this job was submitted to the Mox scheduler on 20180829 and finished on 20180901.

After job completion, the entire folder was gzipped (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_geoduck_RNAseq | pigz > 20180827_trinity_geoduck_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll get some annotations going by running through TransDecoder and blastx.

Share

Second look at Geoduck transcriptome

Last week I popped out a quick assembly and annotation on our geoduck gonadal transcriptome. A second assembly was also done using Trinity.


Updates
August 3 – Confirmed // in file location had no impact on assembly.
July 14 – TransDecoder protein annotations
10:40am – added TransDecoder results
10:29am – added Stats via Trinity


Trinity.pl 
--seqType fq 
-JM 24G 
--left /Volumes/web/cnidarian/Geo_Pool_F_GGCTAC_L006_R1_001_val_1.fq /Volumes/web/cnidarian/Geo_Pool_M_CTTGTA_L006_R1_001_val_1.fq 
--right /Volumes/web/cnidarian//Geo_Pool_F_GGCTAC_L006_R2_001_val_2.fq /Volumes/web/cnidarian//Geo_Pool_M_CTTGTA_L006_R2_001_val_2.fq 
--CPU 16 

trinity_out_dir_1B54203C.png

Output

0:999   127840
1000:1999   18164
2000:2999   5321
3000:3999   1817
4000:4999   762
5000:5999   291
6000:6999   135
7000:7999   73
8000:8999   22
9000:9999   29
10000:10999     4
11000:11999     5
12000:12999     3
13000:13999     4
14000:14999     4
15000:15999     3
16000:16999     0
17000:17999     2
18000:18999     1

Total length of sequence:   101862868 bp
Total number of sequences:  154480
N25 stats:          25% of total sequence length is contained in the 8095 sequences >= 2045 bp
N50 stats:          50% of total sequence length is contained in the 26158 sequences >= 1014 bp
N75 stats:          75% of total sequence length is contained in the 64574 sequences >= 446 bp
Total GC count:         37657770 bp
GC %:               36.97 %
hummingbird:Geo-trinity steven$ /Users/gilesg/compile/trinityrnaseq_r20131110/util/TrinityStats.pl /Volumes/web/cnidarian/Geo-trinity/trinity_out_dir/Trinity.fasta 


################################
## Counts of transcripts, etc.
################################
Total trinity transcripts:  154480
Total trinity components:   100155
Percent GC: 36.97

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 3444
    Contig N20: 2385
    Contig N30: 1766
    Contig N40: 1343
    Contig N50: 1014

    Median contig length: 371
    Average contig: 659.39
    Total assembled bases: 101862868


#####################################################
## Stats based on ONLY LONGEST ISOFORM per COMPONENT:
#####################################################

    Contig N10: 2999
    Contig N20: 2026
    Contig N30: 1462
    Contig N40: 1067
    Contig N50: 768

    Median contig length: 321
    Average contig: 553.88
    Total assembled bases: 55473621

Rerunning to see if double slash was a problem- did not see anything in error. Also running TransDecoder


TransDecoder Results

Ran the following

/Users/gilesg/compile/trinityrnaseq_r20131110/trinity-plugins/TransDecoder_r20131110/TransDecoder -t  /Volumes/web/cnidarian/Geo-trinity/trinity_out_dir/Trinity.fasta

This provided a peptide file with 36003 sequences.

!head /Volumes/web-1/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep

>cds.comp100047_c0_seq2|m.5982 comp100047_c0_seq2|g.5982 ORF comp100047_c0_seq2|g.5982 comp100047_c0_seq2|m.5982 type:internal len:142 (-) comp100047_c0_seq2:3-425(-)
NAECRDLYKIFTQILSVRSQEGKIVIPDEFATKIRNWLGNKEELFKEAHNQKIITFYNEY
TREENTFNPIRGKRPMSVPDMPERKYIDQLSRKTQSQCDFCKYKTFTAEDTFGRIDSNFS
CSASNAFKLDHWHALFLLKTH


Running blastp on Trinity.fasta.transdecoder.pep

!blastp 
-query /Volumes/web/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep 
-db /usr/local/bioinformatics/dbs/uniprot_sprot.fasta 
-evalue 1e-5 
-max_target_seqs 1 
-max_hsps 1 
-outfmt 6 
-num_threads 4 
-out /Volumes/web/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep-blastp-uniprot-2.out

results: http://eagle.fish.washington.edu/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep-blastp-uniprot-2.out

Share

Transcriptome Assembly

 

Trinity

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  117729
Total trinity transcripts:  145222
Percent GC: 40.87

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 3392
    Contig N20: 2262
    Contig N30: 1685
    Contig N40: 1268
    Contig N50: 946

    Median contig length: 335
    Average contig: 617.51
    Total assembled bases: 89676218


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 3252
    Contig N20: 2047
    Contig N30: 1465
    Contig N40: 1065
    Contig N50: 767

    Median contig length: 305
    Average contig: 547.96
    Total assembled bases: 64511364


#!/bin/bash

TRIN="/home/ggoetz/compile/trinityrnaseq_r20140413p1"

export PATH=~/compile/rsem-1.2.3/sam:${PATH}

${TRIN}/Trinity 
    --seqType fq 
    --JM 44G 
    --left left.fq 
    --right right.fq 
    --CPU 6 
    --normalize_reads 
    --min_kmer_cov 2 
    --quality_trimming_params 
    "LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLENGTH:36"
In [3]:
!head //Volumes/web/cnidarian/SeaStar/trinity_assemblies/run1/Trinity.fasta
>c2_g1_i1 len=233 path=[20:0-32 20:33-65 20:66-98 52:99-232]

TCTTGGTCTTGGACGTGGACTTGCTGGTCTTGGTCTTGGTCTTGGACGTGGACTTGCTGG

TCTTGGTCTTGGTCTTGGACGTGGACTTGCTGGTCTTGGTCTTGTTCTTGGTCTTGTTCT

TTGTCTTGTTCTTGTTCTTGTTTATGTCCTTGTTTAGGGTTGTTGTTGGGTTTGTTGCTG

TGTTTTGGCGGGTTGTTGTTGTTTTGGGGGTTTTGGTTGTTTGTTTGTTTGTG

>c108_g1_i1 len=239 path=[1:0-131 133:132-238]

CACTTCGTATATGCTTTATAGACTTCTTGTACGATGTAAAACTCAGACTTTTAAAATCTT

TTCTCATTTTTTGTAAAACTTTATAGAATAATTTTTTCTCTCTTGGGATATATCTACACT

TTCAACTTGCTTAAAAAAAATATAGATAGTGTATGGTGTATGGAGGATTGTGTATTTCAC

ATGTGAGGTACTGTGTTACTAAATTTAGTTGTCGTGACAGAGAGAGGAACAGAGCAGGG


In [5]:
!fgrep -c ">" /Volumes/web/cnidarian/SeaStar/trinity_assemblies/run1/Trinity.fasta
145222


In [6]:
!fgrep -c ">" /Volumes/web/cnidarian/SeaStar/trinity_assemblies/run2/Trinity.Cufffly.fasta
160038


CLC

Trimming

greenbird_1977290F.png

de novo assembly

greenbird_19772A25.png

summary stats

greenbird_19772955.png

In [8]:
!head /Volumes/web/cnidarian/SeaStar_transc_v2.fa
>3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1_(paired)_trimmed_(paired)_contig_1

CAAATATATGAACGGTTGATTGTCAACGATTAGTACATGTTTTCATTGTTCCCCACGCCC

GCCCCCCCCCACTCAAACATTTAAAGTGTGAAATATTATTTATCCACAAATTTCCTTAAA

CCTGCAAACTTGTCTGCTGTCTCTTATTGGAAGTTATGAAAAAGAACAACGGGTTTTCTT

TAAAGGGTCTGCGTGCGATTTTCAACCTTTTGAGTAATAGCAGTTATTTTGATAACCGAT

TTTTTTCAAAGCTCAACAGCTTTTTAAAATAAGGAATCCTATAATGGCCAAACGAATACT

ATAAAAATAAGGGTTCTCTTAATTGTATAAAACGTATAATTTTATCAATTTTGGGACCGT

GTAATTTTTTAAAGACCACAAGAATGTTACATACAACAAATAGACGAAACTCGTAGCTTT

GGAAACTACGTCATGGGCGTTTGGTCAAAAGCTGGAGAGAAAGAGAGGTGGGGTGCCAGA

CTTAAGTAGTCACGTGATCTGACCAACGCACATCGGAAGCTCGATCGGATGAAATCTTCT


In [9]:
!fgrep -c ">" /Volumes/web/cnidarian/SeaStar_transc_v2.fa
30578


In []:

 

Share