I decided to run Canu myself, since documentation for Sean’s Canu run is a bit lacking. Additionally, it looks like he specified a genome size of 500Mbp, which is probably too small. For this assembly, I set the genome size to 1.9Gbp (based on the info in the BGI assembly report, using 17-mers for calculating genome size), which is probably on the large size.
Additionally, I remembered we had an old PacBio run that we had been forgetting about and thought it would be nice to have incorporated into an assembly.
See all the messy details of this in the Jupyter Notebook below, but here’s the core info about this Canu assembly.
PacBio Input files (available on Owl/nightingales/O_lurida:
m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz
Canu execution command (see the Jupyter Notebook below for more info):
$time canu \ useGrid=false \ -p 20171009_oly_pacbio \ -d /home/data/20171018_oly_pacbio_canu/ \ genomeSize=1.9g \ correctedErrorRate=0.075 \ -pacbio-raw m*
Well, this took a LONG time to run; a bit over two days!
The report file contains some interesting tidbits. For instance:
- Unitgigging calculates only 1.84x coverage
- Trimming removed >5 billion (!!) bases:
867850 reads 5755379456 bases (reads with no overlaps, deleted)
- Unitigging unassembled:
unassembled: 479693 sequences, total length 2277137864 bp
I’ll compare this Canu assembly against Sean’s Canu assembly and see how things look.
Report file (text file): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.report
Contigs Assembly (FASTA): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.contigs.fasta
Complete Canu output directory: http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/
Jupyter Notebook (GitHub): 20171018_docker_oly_canu.ipynb