We received the final geoduck genome assembly data from Phase Genomics, in which they updated the assembly by performing some manual curation:
There are additional assembly files that provide some additional assembly data. See the following directory:
Actual sequencing data and two previous assemblies were previously received on 20180421.
All assembly data (both old and new) from Phase Genomics was downloaded in full from the Google Drive link provided by them and stored here on Owl:
Ran Quast to compare all three assemblies provided (command run on Swoose):
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-03 11:05:41.596285/PGA_assembly.fasta /mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-21 18:09:04.514704/PGA_assembly.fasta /mnt/owl/Athaliana/20180822_phase_genomics_geoduck_Results/geoduck_manual/geoduck_manual_scaffolds.fasta
Quast output folder: results_2018_08_23_07_38_28/
Quast report (HTML): results_2018_08_23_07_38_28/report.html
Steven asked me to create a couple of subsets of our Phase Genomics Hi-C geoduck genome assembly (pga_02):
- Contigs >10kbp
- Contigs >30kbp
I used pyfaidx on Roadrunner and the following commands:
faidx --size-range 10000,100000000 PGA_assembly.fasta > PGA_assembly_10k_plus.fasta
faidx --size-range 30000,100000000 PGA_assembly.fasta > PGA_assembly_30k_plus.fasta
Ran Quast afterwards to get stats on the new FastA files just to confirm that the upper cutoff value was correct and didn’t get rid of the largest contig(s).
faidx Output folder: 20180512_geoduck_fasta_subsets/
10kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_10k_plus.fasta
30kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_30k_plus.fasta
Quast output folder: results_2018_05_14_06_26_26/
Quast report (HTML): results_2018_05_14_06_26_26/report.html
Everything looks good. The main thing I wanted to confirm by running Quast was that the largest contig in each subset was the same as the original PGA assembly (95,480,635bp.
We received sequencing/assembly data from Phase Genomics.
The data contains two assemblies, produced on two different dates.
All data is here: 20180421_geoduck_hi-c
All FASTQ files (four files;
Geoduck_HiC*.gz) were copied to Nightingales:
MD5 checksums were verified and appended to the Nightingales checksum file:
Nightingales sequencing inventory was updated (Google Sheet):
The two assemblies (and assembly stats) they provided are here:
I’ve updated the project-geoduck-genome GitHub wiki with this info.