Assembly Stats – Geoduck Hi-C Assembly Comparison

Ran the following Quast command to compare the two geoduck assemblies provided to us by Phase Genomics:

/home/sam/software/quast-4.5/quast.py 
-t 24 
--labels 20180403_pga,20180421_pga 
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-03 11:05:41.596285/PGA_assembly.fasta 
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-21 18:09:04.514704/PGA_assembly.fasta
Results:

Quast Output folder: results_2018_04_30_11_16_04/

Quast report (HTML): results_2018_04_30_11_16_04/report.html

The two assemblies are nearly identical. Interesting…

DNA Isolation & Quantification – Metagenomics Water Filters

After discussing the preliminary DNA isolation attemp with Steven & Emma, we decided to proceed with DNA isolations on the remaining 0.22μm filters.

Isolated DNA from the following five filters:

DNA was isolated with the DNeasy Blood & Tissue Kit (Qiagen), following a modified version of the Gram-Positive Bacteria protocol:

  • filters were unfolded and unceremoniously stuffed into 1.7mL snap cap tubes
  • did not perform enzymatic lysis step
  • filters were incubated with 400μL of Buffer AL and 50μL of Proteinase K (both are double the volumes listed in the kit and are necessary to fully coat the filter in a 1.7mL snap cap tube)
  • 56oC incubations were performed overnight
  • 400μL of 100% ethanol was added to each after the 56oC incubation
  • samples were eluted in 50μL of Buffer AE
  • all spins were performed at 20,000g

Samples were quantified with the Roberts Lab Qubit 3.0 and the Qubit 1x dsDNA HS Assay Kit.

Used 5μL of each sample for measurement (see Results for update).

Results:

Raw data (Google Sheet): 20180426_qubit_metagenomics_filters

Sample Concentration(ng/μL) Initial_volume(μL) Yield(ng)
Filter #10 pH 7.1 5/15/17 0.296 50 14.65
Filter #7 pH 8.2 5/15/17 8.44 50 422
Filter #7 pH 8.2 5/1917 2.52 50 126
Filter #10 pH 7.1 5/22/17 2.0 50 100
Filter #10 pH 7.1 5/26/17 11.9 50 595

Samples were stored Sam gDNA Box #2, positions G8 – H3. (FTR 213, #27 (small -20oC frezer))

QSAR modeling for Series 4

I tried a ligand-centric, QSAR approach to model activity for Series 4. I took the S4 compounds on the April 7, 2018 Master List and separated them into two .csv files, one containing compounds with measured activities and one with unmeasured compounds.

For both data sets, I used the SMILES strings to calculate 1D and 2D molecular descriptors using PaDEL (http://www.yapcwsoft.com/dd/padeldescriptor/), which is freely available. PaDEL calculates 1,444 1D and 2D descriptors. It was not obvious how to proceed with putting together a single activity value for the compounds, as they were measured in different assays, and many had values such as >10 or >50. For values such as >X, I entered a value that was 2X. I tried various methods to perform regression in Weka (https://www.cs.waikato.ac.nz/~ml/weka/), freely available, focusing on the more interesting machine learning methods. 20% of the data was set aside as a test set for cross-validation. Random forest performed by far the best, using 136 trees, with a max depth of 16, and max features of 6 per tree.

Statistics calculated for the training vs test predictions:

Correlation coefficient 0.7178
Kendall's tau 0.4993
Spearman's rho 0.647
Mean absolute error 7.1934
Root mean squared error 12.3386
Relative absolute error 78.7046 %
Root relative squared error 70.3855 %

 

Using xgboost (extreme gradient boosted trees, https://github.com/dmlc/xgboost) through python, I got better results (with 120 estimators, max depth=3, learning rate =0.1, subsample = 0.9).

R2: 0.91

Kendall's tau: 0.49

Spearman's rho: 0.62

Mean squared error: 49.7.

 

I then trained xgboost on the total set of measured compounds (training + test). The model was saved as are the predictions made for the training and test sets. I applied the final model to the set of untested compounds. The most promising compounds are in order, OSM-S-486, OSM-S-433, OSM-S-536, OSM-S-538, OSM-S-204.

 

Total Alkalinity Calculations – Yaamini’s Ocean Chemistry Samples

I ran a subset of Yaamini’s ocean chemistry samples on our T5 Excellence titrator (Mettler Toledo) at the beginning of April. The subset were samples taken from the beginning, middle, and end of the experiment. The rationale for this was to assess whether or not total alkalinity (TA) varied across the experiment. If there was little variation, then there’d likely be no need to run all of the samples. However, should there be temporal differences, then all samples should be processed.

Data analysis was performed in the following R Project:

The R Project above was initially copied from the R Project for our titrator on GitHub:

Three separate, data-file-specific versions of the TA_calculations.R script were created and run:

Salinity values (PSU) were collected from the following spreadsheet (Google Sheet) and manually entered in each of the R scripts:

Specifically, the TA calculations were performed using the seacarb library, with the at() function.

Results:
sample_names TA_values (μmol/kg)
H1 A 2/20/17 2390.88423
H2 A 2/20/17 2393.39207
T1 A 2/20/17 2367.78791
T2 A 2/20/17 2319.39360
T3 A 2/20/17 2309.88602
T4 A 2/20/17 2287.72108
T5 A 2/20/17 2336.14773
T6 A 2/20/17 2298.36327
H1 A 3/20/17 2870.73309
H2 A 3/20/17 2760.49972
T1 A 3/20/17 2930.29308
T2 A 3/20/17 2925.95472
T3 A 3/20/17 2896.55123
T4 A 3/20/17 2769.72514
T5 A 3/20/17 2743.33934
T6 A 3/20/17 2727.94064
H1 A 4/4/17 2770.20971
H2 A 4/4/17 2656.27437
T1 A 4/4/17 2801.77913
T2 A 4/4/17 2822.51611
T3 A 4/4/17 2800.87387
T4 A 4/4/17 2584.60933
T5 A 4/4/17 2645.37017
T6 A 4/4/17 2604.22677

Well, it certainly looks like there’s some variation across the experiment. It’s likely that all remaining samples will need to be processed. Will pass along data to Yaamini for her to evaluate.

Assembly – SparseAssembler (k 111) on Geoduck Sequence Data

Continuing to try to find the best kmer setting to work with SparseAssemlber after the last attempt failed due to a kmer size that was too large (k 131; which happens to be outside the max kmer size [127] for SparseAssembler), I re-ran SparseAssembler with an arbitrarily selected kmer size < 131 (picked k 111).

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Well, this is disappointing. Not entirely sure why this is the case, as it’s below the max kmer setting for SparseAssembler. However, I’m not terribly surprised, as this happened previously (only using NovaSeq data) with a kmer setting of 117.

I’ve posted an issue on the kmergenie GitHub page; we’ll see what happens.

Assembly – SparseAssembler (k 131) on Geoduck Sequence Data

After some runs with kmergenie, I’ve decided to try re-running SparseAssembler using a kmer setting of 131.

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Looking into this, it’s because the maximum kmer size for kmergenie is 127! Doh!

It’d be nice if the program looked at that setting first before processign all the data files…

A bit disappointing, but I’ll give this a go with a lower kmer setting and see how it goes.

Data Management – Geoduck Phase Genomics Hi-C Data

We received sequencing/assembly data from Phase Genomics.

The data contains two assemblies, produced on two different dates.

All data is here: 20180421_geoduck_hi-c

All FASTQ files (four files; Geoduck_HiC*.gz) were copied to Nightingales:

MD5 checksums were verified and appended to the Nightingales checksum file:

Nightingales sequencing inventory was updated (Google Sheet):

The two assemblies (and assembly stats) they provided are here:

I’ve updated the project-geoduck-genome GitHub wiki with this info.

Kmer Estimation – Kmergenie (k 301) on Geoduck Sequence Data

Continuing the quest for the ideal kmer size to use for our geoduck assembly.

The previous two runs with kmergenie using the diploid setting were no good.

So, this time, I simply increased the maximum kmer size to 301 and left all other settings as default. I’m hoping this is large enough to produce a smooth curve, with a maximal value that can be determined from the output graph.

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

Kmer histogram (HTML) reports:

Well, the graph is closer to what we’d expect, in that it appears to reach a zenith, but after that plateau, we see a sharp dropoff, as opposed to a gradual dropoff that mirrors the left half. Not entirely sure what the implications for this are, but I’ll go ahead an run SparseAssembler using a kmer size of 131 and see how it goes.

Kmer Estimation – Kmergenie Tweaks on Geoduck Sequence Data

Earlier today, I ran kmergenie on our all of geoduck DNA sequencing data to see what it would spit out for an ideal kmer setting, which I would then use in another assembly attempt using SparseAssembler; just to see how the assembly might change.

The output from that kmergenie run suggested that the ideal kmer size exceeded the default maximum (k = 121), so I decided to run kmergenie a few more times, with some slight changes.

All jobs were run on our Mox HPC node.

Run 1
Run 2
Results:

Output folders:

Slurm output files:

Kmer histogram (HTML) reports:

Diploid





Diploid, k 301

Okay, well, these graphs clearly show that the diploid setting is no good.

We should be getting a nice, smooth, concave curve.

Will try running again, without diploid setting and just increasing the max kmer size.

Kmer Estimation – Kmergenie on Geoduck Sequence Data (default settings)

After the last SparseAssembler assembly completed, I wanted to do another run with a different kmer size (last time was arbitrarily set at 101). However, I didn’t really know how to decide, particularly since this assembly consisted of mixed read lenghts (50bp and 100bp). So, I ran kmergenie on all of our geoduck (Panopea generosa) sequencing data in hopes of getting a kmer determination to apply to my next assembly.

The job was run on our Mox HPC node.

Slurm script: 20180419_kmergenie_geoduck_slurm.sh

Input files list (needed for kmergenie command – see Slurm script linked above): geoduck_fastq_list.txt

Results:

Output folder: 20180419_kmergenie_geoduck/

Slurm output file: 20180419_kmergenie_geoduck/slurm-161551.out

Kmer histograms (HTML): 20180419_kmergenie_geoduck/histograms_report.html

Screen cap from Kmer report:

This data estimates the best kmer size for this data to be 121.

However, based on the kmergenie documentation, this is likely to be inaccurate. This inaccuracy is based on the fact that our kmer graph should be concave. Our graph, instead, is only partial – we haven’t reached a kmer size where the number of kmers is decreasing.

As such, I’ll try re-running with a different maximum kmer settting (default max is 121).