Tag Archives: transcriptome

Bedgraph – Olympia oyster transcriptome with Olurida_v081 genome assembly

I took the sorted BAM file from yesterday’s corrected RNAseq genome alignment and converted it to a bedgraph using BEDTools genomeCoverageBed tool.

Analysis took place on our HPC Mox node.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \n >> system_path.log

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=/gscratch/scrubbed/samwhite/20180925_oly_RNAseq_genome_hisat2/20180925_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed 
-ibam ${oly_transcriptome_bam} 
-bga 
-split 
-trackline 
> 20180926_oly_RNAseq.bedgraph

RESULTS

Output folder:

Bedgraph file (1.2GB):

Loaded in to IGV to verify things looked OK:

Share

Bedgraph – Olympia oyster transcriptome (FAIL)

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Then, aligned the assembled transcriptome to our genome using Bowtie2.

Finally, I used BEDTools to convert the BAM to BED to bedgraph.

This required an initial indexing of our Olympia oyster genome FastA using samtools faidx tool.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180924_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180924_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \n >> system_path.log


# Set genome assembly FastA
oly_genome_fasta=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa

# Set indexed genome assembly file
oly_genome_indexed=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa.fai

# Set sorted transcriptome assembly bam file
oly_transcriptome=/gscratch/scrubbed/samwhite/20180919_oly_transcriptome_bowtie2/20180919_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools

# Index genome FastA
${samtools} faidx ${oly_genome_fasta}

# Format indexed genome for bedtools
## Requires only two columns: namelength
awk -v OFS='t' {'print $1,$2'} ${oly_genome_indexed} > Olurida_v081.fa.fai.genome

# Create bed file
${bedtools}/bamToBed 
-i ${oly_transcriptome} 
> 20180924_oly_RNAseq.bam.bed


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed 
-i ${PWD}/20180924_oly_RNAseq.bed 
-g Olurida_v081.fa.fai.genome 
-bga 
-split 
-trackline 
> 20180924_oly_RNAseq.bed

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Indexed and formatted genome file:

Bedgraph file (for IGV):


This doesn’t appear to have worked properly. Here’s a view of the bedgraph file:


track type=bedGraph
Contig0 0   116746  0
Contig1 0   87411   0
Contig2 0   139250  0
Contig3 0   141657  0
Contig4 0   95692   0
Contig5 0   130522  0
Contig6 0   94893   0
Contig7 0   109667  0
Contig8 0   95943   0

I’d expect multiple entries for each contig (ideally), indicating start/stop positions for where transcripts align within a given contig. However, this appears to simply be a list of all the genome contigs and their lengths (Start=0, Stop=n).

I would expect to see something li

I’ll look into this further and see where this pipeline goes wrong.

Share

Transcriptome Alignment – Olympia oyster Trinity transcriptome aligned to genome with Bowtie2

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Next up, align transcriptome to Olympia oyster genome.

Alignment and creation of BAM files was done using Bowtie2 on our HPC Mox node.

SBATCH script file:

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Sorted BAM file:

Sorted & indexed BAM file (for IGV):

Will get the sorted BAM file converted to a bedgraph for use in IGV.

Share

Second look at Geoduck transcriptome

Last week I popped out a quick assembly and annotation on our geoduck gonadal transcriptome. A second assembly was also done using Trinity.


Updates
August 3 – Confirmed // in file location had no impact on assembly.
July 14 – TransDecoder protein annotations
10:40am – added TransDecoder results
10:29am – added Stats via Trinity


Trinity.pl 
--seqType fq 
-JM 24G 
--left /Volumes/web/cnidarian/Geo_Pool_F_GGCTAC_L006_R1_001_val_1.fq /Volumes/web/cnidarian/Geo_Pool_M_CTTGTA_L006_R1_001_val_1.fq 
--right /Volumes/web/cnidarian//Geo_Pool_F_GGCTAC_L006_R2_001_val_2.fq /Volumes/web/cnidarian//Geo_Pool_M_CTTGTA_L006_R2_001_val_2.fq 
--CPU 16 

trinity_out_dir_1B54203C.png

Output

0:999   127840
1000:1999   18164
2000:2999   5321
3000:3999   1817
4000:4999   762
5000:5999   291
6000:6999   135
7000:7999   73
8000:8999   22
9000:9999   29
10000:10999     4
11000:11999     5
12000:12999     3
13000:13999     4
14000:14999     4
15000:15999     3
16000:16999     0
17000:17999     2
18000:18999     1

Total length of sequence:   101862868 bp
Total number of sequences:  154480
N25 stats:          25% of total sequence length is contained in the 8095 sequences >= 2045 bp
N50 stats:          50% of total sequence length is contained in the 26158 sequences >= 1014 bp
N75 stats:          75% of total sequence length is contained in the 64574 sequences >= 446 bp
Total GC count:         37657770 bp
GC %:               36.97 %
hummingbird:Geo-trinity steven$ /Users/gilesg/compile/trinityrnaseq_r20131110/util/TrinityStats.pl /Volumes/web/cnidarian/Geo-trinity/trinity_out_dir/Trinity.fasta 


################################
## Counts of transcripts, etc.
################################
Total trinity transcripts:  154480
Total trinity components:   100155
Percent GC: 36.97

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 3444
    Contig N20: 2385
    Contig N30: 1766
    Contig N40: 1343
    Contig N50: 1014

    Median contig length: 371
    Average contig: 659.39
    Total assembled bases: 101862868


#####################################################
## Stats based on ONLY LONGEST ISOFORM per COMPONENT:
#####################################################

    Contig N10: 2999
    Contig N20: 2026
    Contig N30: 1462
    Contig N40: 1067
    Contig N50: 768

    Median contig length: 321
    Average contig: 553.88
    Total assembled bases: 55473621

Rerunning to see if double slash was a problem- did not see anything in error. Also running TransDecoder


TransDecoder Results

Ran the following

/Users/gilesg/compile/trinityrnaseq_r20131110/trinity-plugins/TransDecoder_r20131110/TransDecoder -t  /Volumes/web/cnidarian/Geo-trinity/trinity_out_dir/Trinity.fasta

This provided a peptide file with 36003 sequences.

!head /Volumes/web-1/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep

>cds.comp100047_c0_seq2|m.5982 comp100047_c0_seq2|g.5982 ORF comp100047_c0_seq2|g.5982 comp100047_c0_seq2|m.5982 type:internal len:142 (-) comp100047_c0_seq2:3-425(-)
NAECRDLYKIFTQILSVRSQEGKIVIPDEFATKIRNWLGNKEELFKEAHNQKIITFYNEY
TREENTFNPIRGKRPMSVPDMPERKYIDQLSRKTQSQCDFCKYKTFTAEDTFGRIDSNFS
CSASNAFKLDHWHALFLLKTH


Running blastp on Trinity.fasta.transdecoder.pep

!blastp 
-query /Volumes/web/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep 
-db /usr/local/bioinformatics/dbs/uniprot_sprot.fasta 
-evalue 1e-5 
-max_target_seqs 1 
-max_hsps 1 
-outfmt 6 
-num_threads 4 
-out /Volumes/web/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep-blastp-uniprot-2.out

results: http://eagle.fish.washington.edu/cnidarian/Geo-trinity/Trinity.fasta.transdecoder.pep-blastp-uniprot-2.out

Share