Reproducible manuscripts are the future?

This week, a paper that was almost three years in the making finally got published. I feel confident about the paper and the results in it, not because it took three years to write, but because I used a dynamic document to produce it (e.g., Rmarkdown).

Dynamic document? Yes! I no longer had to manually enter all results from the data into tables or the text — the computer did it for me. All I had to do was point it in the right direction. Figures? The same! It saved me tons of time after I made the initial investment to learn how to use it (something else that saved me time was git version control, but that’s for another time).

Why is this important? We are all human, and we make mistakes. And that’s okay! What matters is how we try to remedy those mistakes if they do occur, but even more important, if we can change the way we work in order to prevent some of them, that can help us tremendously. I think dynamic documents like Rmarkdown help us do so.

Markdown is a simple document language, which you can create in any text editor (notepad as well). All it does is standardize how headers are defined (with asterisks, # Header level 1, ## Header level 2, etc.) and how text style is defined (e.g., *text* is italic text). Subsequently, the text file can be converted to pretty much anything (e.g., html, pdf, and even a Word file for those relentless co-authors who love track changes so much).

Rmarkdown takes markdown, and allows you to put R code in between text chunks (which it actually runs!) or even WITHIN the text. Yes, you read that correctly. As such, you can do analyses, make figures, format results (no more manual p-values! statcheck won’t find any errors if you use Rmarkdown) AUTOMATICALLY.

I will just show one exciting and simple aspect, but more step-by-step guides are available (if you want to follow along, install R and Rstudio).

Usually, we tend to type results in the running text ourselves, like such.

Using Rmarkdown to just write a document

Using Rmarkdown to just write a document

As we see, RMarkdown creates a document from a very simple plain text document (this is just markdown doing what it’s supposed to). However, we have a p-value that is calculated based on that t-value and degrees of freedom. So let’s make it dynamic to ensure we have the rounding correct.

Using RMarkdown to generate a document with dynamic results, to ease result presentation

Using RMarkdown to generate a document with dynamic results, to ease result presentation

As we see, the original contained a mistake (p = .027 now turned it into .028) — but Rmarkdown allowed us to catch that by just putting in the R code that generates that p-value and rounds it (i.e., round(pt(q = 1.95, df = 69, lower.tail = FALSE), 3)). No more mistake, and we can be confident. Disclaimer: of course you can still input wrong code — garbage in garbage out!

But this is just a simple example. You can write entire manuscripts in this type of way. That’s what I did for our Collabra manuscript (see here [1]). You can even use citations and alter the citation style without any problem; my experience is that it’s easier with RMarkdown than with EndNote or Mendeley even. All it takes is some initial time investment to learn how to work with it (Markdown can be learned in five minutes) and change your workflow to accomodate this modern approach to writing manuscripts.

The only downside to working this way is that journals don’t accept a raw RMarkdown file as submission, which is too bad — they could link the results directly to the code that produces a result. Now we still end up with a document (e.g., Word file) that hard-codes all results as traditionally was the case. I hope dynamic documents will become more and more widespread in the future, both in how often they’re used by the authors and how publishers support this type of document to truly innovate how scholarly information is communicated and consumed. Image just getting a highlight when you hover over a result, and seeing the underlying code — it would allow you to more directly evaluate the methods in a paper and empower you as a reader to be critical of what you are presented with.

[1] I preferred LaTeX for that project and used Sweave, which is RMarkdown for LaTeX

UPDATE: This blog post has been cross-posted on both the eLife innovation blog and R-bloggers. For more R news and tutorials, please visit https://www.r-bloggers.com/.’

Model exploitation: proposal for World’s First Crowd Sourced Drug Design Campaign

If after the publishing of the OSM hidden test set our predictive model for PfATP4 Ion Regulation Activity results to be useful, it can be effectively and thoroughly exploited by anybody after Molomics provides it in Lead Designer, an Android app to easily and quickly access molecule properties important in drug discovery.
Lead Designer allows to easily sketch new molecules with an easy, fully automatized touchpad drawing mechanism. For each molecule, PfATP4 Ion Regulation Activity class and its associated prediction confidence can be instantaneously calculated on the fly. In this way all the people willing to participate in the OSM project, especially medicinal and synthetic chemists, can do design hypothesis for new active compounds and easily check in Real-Time if these compounds have high chances to be active or not (according to the provided prediction model). Each user can save her or his interesting molecules on the cloud to later access them from different devices through its own account.
If the current proposal is of interest, especially to medicinal and synthetic chemists involved in the project, Lead Designer could be used for the design of new active compounds of OSM Series-4. All the molecules designed for the project through Lead Designer are automatically collected on the cloud and then provided to the OSM consortium for possible synthesis and testing. As Lead Designer can involve an arbitrary large number of participants spread around the globe, this project can result in the World's First Crowd Sourced Drug Design Campaign, which can be interesting also for publication purposes.
Please, let us know whether you would be interested in this proposal.

Final Results and Classifier Description.

A neural network meta classifier has a predictive score of AUC = 0.89 on the test molecules.

 

The Meta Classifier

  • Each predictive model based on fingerprints or another SMILE based description vector such as DRAGON brings a certain amount of predictive power to the task of assessing likely molecular activity against PfATP4.

  • What the meta classifier does is combine the predictive power of each model in an optimal way to produce a more predictive composite model.

  • It does this by taking as it's input the probability maps (the outputs) of other classifiers,

  • The two models chosen as inputs to the meta model are:

    1. A Neural Network model that uses the [DRAGON](http://www.vcclab.org/lab/edragon/) molecular descriptor to estimate molecular PfATP4 ion regulation activity directly. This model had modest predictive power of AUC=0.77.
    2. A logistic classifier that uses the Morgan fingerprints (mol radius = 5) to predict the EC50 <= 500 nMol class. This model has a predictive power of AUC=0.93 for the test molecules. 


Detailed Results.

 

AUC 0.89

Confusion Matrix

 

true/predict ACTIVE INACTIVE
ACTIVE 17 1
INACTIVE 7 10


Molecule Classification

 

ID Actual_Class Pred_Class Prob_ACTIVE
OSM-S-272 ACTIVE ACTIVE 0.5870
OSM-S-366 INACTIVE ACTIVE 0.5868
OSM-S-378 ACTIVE ACTIVE 0.5854
OSM-S-389 ACTIVE ACTIVE 0.5846
OSM-S-390 ACTIVE ACTIVE 0.5835
OSM-S-353 ACTIVE ACTIVE 0.5831
OSM-S-175 ACTIVE ACTIVE 0.5830
OSM-S-376 ACTIVE ACTIVE 0.5828
OSM-S-383 ACTIVE ACTIVE 0.5820
OSM-S-369 ACTIVE ACTIVE 0.5819
OSM-S-218 ACTIVE ACTIVE 0.5819
OSM-S-370 ACTIVE ACTIVE 0.5810
OSM-S-380 ACTIVE ACTIVE 0.5808
OSM-S-293 ACTIVE ACTIVE 0.5804
OSM-S-385 ACTIVE ACTIVE 0.5797
OSM-S-384 ACTIVE ACTIVE 0.5795
OSM-S-279 ACTIVE ACTIVE 0.5794
OSM-S-368 INACTIVE ACTIVE 0.5790
OSM-S-386 INACTIVE ACTIVE 0.5784
OSM-S-363 INACTIVE ACTIVE 0.5772
OSM-S-367 INACTIVE ACTIVE 0.5713
OSM-S-373 INACTIVE ACTIVE 0.5703
OSM-S-204 INACTIVE ACTIVE 0.5698
OSM-S-379 ACTIVE ACTIVE 0.5689
OSM-S-201 INACTIVE INACTIVE 0.4121
OSM-S-374 INACTIVE INACTIVE 0.3391
OSM-S-254 INACTIVE INACTIVE 0.2554
OSM-S-372 INACTIVE INACTIVE 0.2453
OSM-S-371 ACTIVE INACTIVE 0.1007
OSM-S-278 INACTIVE INACTIVE 0.0772
OSM-S-375 INACTIVE INACTIVE 0.0584
OSM-S-364 INACTIVE INACTIVE 0.0189
OSM-S-382 INACTIVE INACTIVE 0.0172
OSM-S-387 INACTIVE INACTIVE 0.0004
OSM-S-388 INACTIVE INACTIVE 0.0004

 


The Classification Software.

The Meta Classifier runs on Linux and Windows under Python 2.7 and 3.5 (Mac untested):

  1. Download the entire directory tree from google drive [here](https://drive.google.com/drive/folders/0B0Rfx1fjhlsaZU1MenhlYVc5TVU). You can also download the software from GitHub [here](https://github.com/kellerberrin/OSM-QSAR). However, the google drive version is already has the required directory tree.

  2. Make sure you have activated the python anaconda environment as described in "readme.md".

Then go to the directory where you copied the software and simply execute the prepared batch files:

On Windows:

text code:
c>osm_comp

On Linux:

text code:
$chmod 777 ./osm_comp $./osm_comp

You can also execute the meta model (--help for flag descriptions) directly from the command line (the clean flag is optional it removes previous results from the model directory):

text code:
python OSM_QSAR.py --classify osm --load ION_META --epoch 40 --train 0 [--clean]

You can also classify the molecules proposed by @spadavec in issue #486 (looks like some strong leads here Vito)  by changing the input data file (--data OSMData4MMP.csv):

text code:
python OSM_QSAR.py --classify osm --load ION_META --epoch 40 --train 0 --data OSMData4MMP.csv

The classification results are found in "./Work/osm/test" and "./Work/osm/train". The statistics files contain 3 classifications. The first two are the classifier results that feed into the meta classifier. 

If you want to explore further, then you could train a neural network to classify molecules for EC50 <= 500nMol potency with the Morgan (mol=5) fingerprint using the following command:

text code:
python OSM_QSAR.py --classify bin_m --train 500 --check 25 --depend EC50_500 --indep MORGAN2048_5
 
This trains the neural network for 500 epochs and checkpoints (saves) the neural network every 25 epochs. The results for each checkpoint are concatonated and will be in the directory "./Work/bin_d/test" and "./Work/bin_d/train".
 

PfATP4 Ion Regulation Activity classification model

We developed several PfATP4 Ion Regulation Activity classification models using different strategies for modeling set sampling, different machine learning methods and different descriptors. Here we report the best performing one.

Data and approach 

The total set of 455 compounds with experimental PfATP4 Ion Regulation Activity was submitted to Molomics standard chemical structure curation protocol, similar to the one described by Fourches et Al.1 A curated set of 445 different molecules was obtained.

For the model development, validation and exploitation we followed an internal protocol considering QSAR best practices as defined in literature2,3. The final curated set was split into:

  • a modeling set containing 150 compounds that was subsequently split for internal validation into multiple randomly-chosen, response-stratified training and test sets. The internal validation used a 10-folds cross validation procedure.

  • an external validation set containing 295 compounds.

The OSM competition set consists of 35 compounds obtained from the original data file provided by OSM consortium for this competition. The 35 compounds are those where the Ion Regulation Test Set column is equal to “A,B”, “B” and “C”. Predictions for these compounds were extracted from the test and external validation sets.

The molecules were described with 23 non-highly-correlated (property-based) molecular descriptors and ECFC4 structural fingerprints hashed in 1024-bytes vectors. The machine learning technique used to build the model was an ensemble (Random Forest-like) decision-tree model. The best resulting model uses 15 trees (average tree depth = 15.3; average number of nodes = 47.9).

Results

Results were analyzed considering standard assessment metrics generally used in virtual screening reported for 3 compounds sets: OSM competition, internal validation and external validation sets.

  • confusion matrix (counting correct and wrong classified molecules)

  • accuracy = (TP+TN)/N

  • sensitivity of active molecules. Sensitivity = TP/(TP+FN)

  • specificity of active molecules. Specificity = TN/(TN+FP)

  • balanced accuracy of active molecules. This is very important when the compounds activity is distributed in heavily unbalanced classes, as in the case of OSM. Balanced accuracy = (sensitivity+specificity)/2

  • precision of active molecules. Precision = TP/(TP+FP)

  • Area Under the Curve (AUC) of active molecules

Where TP, TN, FP and FN are True Positives, True Negatives, False Positives and False Negatives, respectively. Active molecules are those with Ion Regulation Activity class = 1.

 

 

OSM competition compounds general results

Confusion matrix:

 

Predicted class

Experimental class

Inactive (0)

Active (1)

Partial (0.5)

Inactive (0)

10

2

1

Active (1)

3

15

0

Partial (0.5)

1

3

0


Assessment metrics:

Assessment metrics

Value

Correct classified

25

Wrong classified

10

Accuracy

0.714

Sensitivity of actives

0.833

Specificity of actives

0.706

Balanced accuracy of actives

0.770

Precision of actives

0.75

AUC

0.810


 

OSM competition compounds individual results

Here we report the individual prediction class for each OSM competition test compound and the class prediction probability for the 3 model classes (i.e. 0, 0.5 and 1).  

Molecule_ID

Ion Regulation Activity class

Prediction (Ion Regulation Activity class)

P(class=0.0)

P(class=1.0)

P(class=0.5)

OSM-S-218

1

1

0

1

0

OSM-S-378

1

1

0

1

0

OSM-S-373

0

0

0.933

0.067

0

OSM-S-372

0

0

0.867

0.133

0

OSM-S-390

1

1

0.067

0.867

0.067

OSM-S-370

1

1

0.2

0.8

0

OSM-S-254

0.5

0

0.733

0.2

0.067

OSM-S-385

1

1

0.267

0.733

0

OSM-S-375

0

0

0.667

0.267

0.067

OSM-S-388

0

0

0.667

0.333

0

OSM-S-382

0

0

0.667

0.333

0

OSM-S-387

0

0

0.667

0.333

0

OSM-S-278

0.5

1

0.333

0.667

0

OSM-S-389

1

1

0.267

0.6

0.133

OSM-S-374

0

1

0.4

0.6

0

OSM-S-204

0.5

1

0.333

0.6

0.067

OSM-S-279

1

1

0.267

0.533

0.2

OSM-S-383

1

1

0.4

0.533

0.067

OSM-S-371

1

0

0.533

0.4

0.067

OSM-S-201

0

0.5

0.267

0.2

0.533

OSM-S-379

1

1

0.4

0.533

0.067

OSM-S-369

1

1

0.333

0.533

0.133

OSM-S-175

1

1

0.4

0.533

0.067

OSM-S-272

1

1

0.467

0.533

0

OSM-S-380

1

0

0.533

0.467

0

OSM-S-363

0

0

0.533

0.4

0.067

OSM-S-353

1

1

0.467

0.533

0

OSM-S-376

1

1

0.133

0.533

0.333

OSM-S-364

0

0

0.533

0.4

0.067

OSM-S-384

1

1

0.467

0.533

0

OSM-S-368

0.5

1

0.4

0.533

0.067

OSM-S-386

0

0

0.533

0.467

0

OSM-S-366

0

1

0.333

0.533

0.133

OSM-S-367

0

0

0.467

0.467

0.067

OSM-S-293

1

0

0.467

0.467

0.067



Internal validation compounds general results

Confusion matrix:

 

Predicted class

Experimental class

Inactive (0)

Active (1)

Partial (0.5)

Inactive (0)

107

3

1

Active (1)

13

22

0

Partial (0.5)

1

3

0

 

Assessment metrics:

Assessment metrics

Value

Correct classified

129

Wrong classified

21

Accuracy

0.860

Sensitivity of actives

0.629

Specificity of actives

0.948

Balanced accuracy of actives

0.788

Precision of actives

0.786

AUC

0.860

 



External validation compounds general results

 Confusion matrix:

 

Predicted class

Experimental class

Inactive (0)

Active (1)

Partial (0.5)

Inactive (0)

272

3

0

Active (1)

12

8

0

Partial (0.5)

0

0

0

 

Assessment metrics:

Assessment metrics

Value

Correct classified

280

Wrong classified

15

Accuracy

0.949

Sensitivity of actives

0.400

Specificity of actives

0.989

Balanced accuracy of actives

0.695

Precision of actives

0.727

AUC

0.835

 

Model statistical significance

In order to asses the statistical significance of the model performance, we developed 100 similar models using a bootstrapped sampling of the modeling set and 100 response-permuted models where the compound response (i.e. the Ion Regulation Activity class) has been randomly permuted for all the compounds. The balanced accuracy distribution of the 100 bootstrapped models is shown in figure 3, while that of the Y-randomized model is shown in figure 4. Where the balanced accuracy is calculated for active molecules (i.e. Ion Regulation Activity = 1).

Distribution of balanced accuracy for active molecules in bootsrapped samples

Figure 3



Distribution of balanced accuracy for active molecules in response-randomized samples

Figure 4 

It can be seen from the figures (figure 3 and figure 4) that the resulting balanced accuracy distributions in the 2 experiment sets are completely non-overlapped. This suggests that the statistical significance of the model is reliable.

 

References

1 Denis Fourches, Eugene Muratov, Alexander Tropsha “Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research” J. Chem. Inf. Model. 2010, 50, 1189-1204.

 

2 Alexander Tropsha “Best Practices for QSAR Model Development, Validation, and Exploitation” Mol. Inf. 2010 Volume 29, Issue 6-7, Pages 476–488.

 

3 Lennart Eriksson, Joanna Jaworska, Andrew P Worth, Mark T D Cronin, Robert M McDowell, Paola Gramatica “Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression-Based QSARs”, Environ. Health Perspect. 2003, 111(10): 1361–1375.


Latest submission

Small tweak in the weighting of the scoring function. Increase weighting of molecules that are similar to the reference (highest affinity) ligand.

 

OSM-S363, 5.8

OSM-S364, 6.3

OSM-S365, 6.4 

OSM-S368, 5.6

OSM-S369, 5.6

OSM-S370, 6.0

OSM-S371, 6.0

OSM-S372, 5.6

OSM-S373, 5.8

OSM-S374, 5.8

OSM-S375, 6.3

Implementing Multitask Models to Improve Testset Performance

INTRODUCTION

Multitask machine learning algorithms train and predict on more than one output. These models have been found to higher prediction performance compared to Single Task models, especially in domains where data is limited. This competition features a small dataset so the utilisation of all available relevant data is crucial to produce a useful model for the unseen validation chemicals. Previous data analysis has found the included ChemBL EC50 data to be non-linearly correlated with the OSM EC50 data, so it is hypothesised that non-linear multitask modelling methodologies will featuer higher performance than singletask models. 

GOAL 

This experiment aims to implement multitask models using the OSM and ChemBL EC50 data in the provided competition dataset and compare their testset prediction performance to single task models.

DATASET PREPARATION

The ChemBL EC50 data was extracted from a previous analysis as the Mean_AltEC50 and appended to the training dataset. The numerical Mean_AltEC50 values were stored in an adjacent column to the OSM EC50 values. 

MODELLING METHODOLOGY

Multitask variants of the Progressive Neural Network (DT-PGN), Deep Neural Network (DT-DNN), and Graph Convolution (DT-GraphConv) machine learning algorithms modelled both tasks in the training dataset, while a Progressive Neural Network only modelling OSM EC50 was chosen as the representative single task model (ST-PGN). All machine learning algorithms modelled 1024 ECFP fingerprints to their respective endpoints, while the DT-Graph Convolution modelled graphical featurizations of each molecule to their classes. A 80/10/10 training/test/validation split of the dataset was used to train and evaluate each model. All model hypermeters were optimised for the best held out validation prediction performance, which consist of the 37 molecules in the combined OSM Testset.

Hyperparameter ST-PGN  DT-PGN DT-DNN

Layers 

2 2 2
Layer dimensions 1000, 500 1500, 1500 1500, 1500
Dropouts per layer 0.15, 0.1 0.1, 0.1 0.1, 0.1
Number of epochs 100 100 100
Optimizer Adam Adam Adam
Batch size

100

32 32
Penalty 0.0001 0.001 0.0001
Learning rate 0.001 0.001 0.001

DT-GraphConv architecture/hyperparameters:

  • Total Layers: 10
  • Layer Configuration: 2x(Convolutional, Normalization, Pooling)
  • Number of epochs: 100
  • Optimizer: Adam
  • Batch size: 128
  • Learning rate: 0.001 

RESULTS  

The multitask DT-PGN and DT-DNN models featured higher  external testset performance than the singletask PGN model, while the multitask DT-GraphConv model featured lower external testset performance compared to the singletask PGN model. Raw predictions for each Testset molecule are in the attached spreadsheet.

There is a substantial prediction performance difference between the Internal Validation and External Testset for all models.

Model  Training (MAE) Internal Validation (MAE) External Testset BC (MAE)
ST-PGN 0.680365333 8.040726457 2.957931574
DT-PGN 0.77677925 6.45179363 2.527557844
DT-GraphConv 1.57414035 5.825520611 3.824818007
DT-DNN 1.026544839 5.363081693 2.791228748

 COMMENTS AND CONCLUSIONS 

  • Multitask models perform better than their singletask counterparts for OSM EC50 prediction.
  • Multitask Graph Convolutional models continue to underperform compared to previous findings.
  • The substantial performance difference between the Internal and External Validation datasets may indicate the molecules in the external testset are not well represented in the training dataset. Future experiments should substitute the training/test/validation splitting of the training dataset with a K-fold cross validation methodology in order to maximise the usage of chemicals in the training set.

Data Received – Olympia oyster PacBio Data

Back in December 2016, we sent off Ostrea lurida DNA to the UW PacBio sequencing facility. This is an attempt to fill in the gaps left from the BGI genome sequencing project.

See the GitHub Wiki dedicated to this for an overview of this UW PacBio sequencing.

I downloaded the data to http://owl.fish.washington.edu/nightingales/O_lurida/20170323_pacbio/ using the required browser plugin, Aspera Connect. Technically, saving the data to a subfolder within a given species’ data folder goes against our data management plan (DMP) for high-throughput sequencing data, but the sequencing data output is far different than what we normally receive from an Illumina sequencing run. Instead of a just FASTQ files, we received the following from each PacBio SMRT cell we had run (we had 10 SMRT cells run):

├── Analysis_Results
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.bax.h5
│   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.bas.h5
├── filter
│   ├── data
│   │   ├── control_reads.cmp.h5
│   │   ├── control_results_by_movie.csv
│   │   ├── data.items.json
│   │   ├── data.items.pickle
│   │   ├── filtered_regions
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── filtered_regions.fofn
│   │   ├── filtered_subread_summary.csv
│   │   ├── filtered_subreads.fasta
│   │   ├── filtered_subreads.fastq
│   │   ├── filtered_summary.csv
│   │   ├── nocontrol_filtered_subreads.fasta
│   │   ├── post_control_regions.chunk001of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   ├── post_control_regions.chunk002of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── post_control_regions.chunk003of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   ├── post_control_regions.fofn
│   │   └── slots.pickle
│   ├── index.html
│   ├── input.fofn
│   ├── input.xml
│   ├── log
│   │   ├── P_Control
│   │   │   ├── align.cmpH5.Gather.log
│   │   │   ├── align.plsFofn.Scatter.log
│   │   │   ├── align_001of003.log
│   │   │   ├── align_002of003.log
│   │   │   ├── align_003of003.log
│   │   │   ├── noControlSubreads.log
│   │   │   ├── summaryCSV.log
│   │   │   ├── updateRgn.noCtrlFofn.Gather.log
│   │   │   ├── updateRgn_001of003.log
│   │   │   ├── updateRgn_002of003.log
│   │   │   └── updateRgn_003of003.log
│   │   ├── P_ControlReports
│   │   │   └── statsJsonReport.log
│   │   ├── P_Fetch
│   │   │   ├── adapterRpt.log
│   │   │   ├── overviewRpt.log
│   │   │   └── toFofn.log
│   │   ├── P_Filter
│   │   │   ├── filter.rgnFofn.Gather.log
│   │   │   ├── filter.summary.Gather.log
│   │   │   ├── filter_001of003.log
│   │   │   ├── filter_002of003.log
│   │   │   ├── filter_003of003.log
│   │   │   ├── subreadSummary.log
│   │   │   ├── subreads.subreadFastq.Gather.log
│   │   │   ├── subreads.subreads.Gather.log
│   │   │   ├── subreads_001of003.log
│   │   │   ├── subreads_002of003.log
│   │   │   └── subreads_003of003.log
│   │   ├── P_FilterReports
│   │   │   ├── loadingRpt.log
│   │   │   ├── statsRpt.log
│   │   │   └── subreadRpt.log
│   │   ├── master.log
│   │   └── smrtpipe.log
│   ├── metadata.rdf
│   ├── results
│   │   ├── adapter_observed_insert_length_distribution.png
│   │   ├── adapter_observed_insert_length_distribution_thumb.png
│   │   ├── control_non-control_readlength.png
│   │   ├── control_non-control_readlength_thumb.png
│   │   ├── control_non-control_readquality.png
│   │   ├── control_non-control_readquality_thumb.png
│   │   ├── control_report.html
│   │   ├── control_report.json
│   │   ├── filter_reports_adapters.html
│   │   ├── filter_reports_adapters.json
│   │   ├── filter_reports_filter_stats.html
│   │   ├── filter_reports_filter_stats.json
│   │   ├── filter_reports_filter_subread_stats.html
│   │   ├── filter_reports_filter_subread_stats.json
│   │   ├── filter_reports_loading.html
│   │   ├── filter_reports_loading.json
│   │   ├── filtered_subread_report.png
│   │   ├── filtered_subread_report_thmb.png
│   │   ├── overview.html
│   │   ├── overview.json
│   │   ├── post_filter_readlength_histogram.png
│   │   ├── post_filter_readlength_histogram_thumb.png
│   │   ├── post_filterread_score_histogram.png
│   │   ├── post_filterread_score_histogram_thumb.png
│   │   ├── pre_filter_readlength_histogram.png
│   │   ├── pre_filter_readlength_histogram_thumb.png
│   │   ├── pre_filterread_score_histogram.png
│   │   └── pre_filterread_score_histogram_thumb.png
│   ├── toc.xml
│   └── workflow
│       ├── P_Control
│       │   ├── align.cmpH5.Gather.sh
│       │   ├── align.plsFofn.Scatter.sh
│       │   ├── align_001of003.sh
│       │   ├── align_002of003.sh
│       │   ├── align_003of003.sh
│       │   ├── noControlSubreads.sh
│       │   ├── summaryCSV.sh
│       │   ├── updateRgn.noCtrlFofn.Gather.sh
│       │   ├── updateRgn_001of003.sh
│       │   ├── updateRgn_002of003.sh
│       │   └── updateRgn_003of003.sh
│       ├── P_ControlReports
│       │   └── statsJsonReport.sh
│       ├── P_Fetch
│       │   ├── adapterRpt.sh
│       │   ├── overviewRpt.sh
│       │   └── toFofn.sh
│       ├── P_Filter
│       │   ├── filter.rgnFofn.Gather.sh
│       │   ├── filter.summary.Gather.sh
│       │   ├── filter_001of003.sh
│       │   ├── filter_002of003.sh
│       │   ├── filter_003of003.sh
│       │   ├── subreadSummary.sh
│       │   ├── subreads.subreadFastq.Gather.sh
│       │   ├── subreads.subreads.Gather.sh
│       │   ├── subreads_001of003.sh
│       │   ├── subreads_002of003.sh
│       │   └── subreads_003of003.sh
│       ├── P_FilterReports
│       │   ├── loadingRpt.sh
│       │   ├── statsRpt.sh
│       │   └── subreadRpt.sh
│       ├── Workflow.details.dot
│       ├── Workflow.details.html
│       ├── Workflow.details.svg
│       ├── Workflow.profile.html
│       ├── Workflow.rdf
│       ├── Workflow.summary.dot
│       ├── Workflow.summary.html
│       └── Workflow.summary.svg
├── filtered_subreads.fasta.gz
├── filtered_subreads.fastq.gz
├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.metadata.xml
└── nocontrol_filtered_subreads.fasta.gz

That’s 20 directories and 127 files – for a single SMRT cell!

Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.

I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).

I’ll also update the GitHub Wiki

qPCR – Ava’s RLO Transmission Samples

Ava provided me with a list of samples that needed to be qPCR’d (Google Sheet): qPCR redos 30117.xlsx

Here’s a list of samples that had no liquid left in them (likely due to evaporation). I added 5uL of nuclease-free water to each sample in hopes of gleaning some data from them:

14
22
37
38
46
48
49
50
52
55
61
65
116
127
149
152
155
157
158

The following samples are samples that I used the remainder of them for these qPCR reactions:

60F1
120
136

Standard curve was p18RK7 from 20161128.

All samples were run in duplicate.

Master mix calcs are here (Google Sheet): 20170322 – qPCR WSN1 Ava Samples 01

Plate layouts, cycling params, etc. can be seen in the corresponding qPCR Reports (see Results below).

Baseline threshold was manually set to 580, based on the Lisa’s development of the withering syndrome qPCR assay.

 

Results:

All but the final plate look good (standard curve-wise). Will re-run last plate next week.

qPCR Report (PDF): Sam_2017-03-22 07-24-02_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 07-24-02_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 08-54-50_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 08-54-50_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 10-25-58_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 10-25-58_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 11-54-57_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 11-54-57_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 13-23-37_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 13-23-37_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 14-51-55_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 14-51-55_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-22 16-19-59_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-22 16-19-59_CC009827.pcrd

qPCR Report (PDF): Sam_2017-03-23 06-54-02_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-23 06-54-02_CC009827.pcrd

 

NOTE: This needs to be re-run, due to a wonky rep of one of the points of the standard curve.
qPCR Report (PDF): Sam_2017-03-23 08-24-59_CC009827.pdf
qPCR Data File (CFX96): Sam_2017-03-23 08-24-59_CC009827.pcrd

Assessing the correlation between OSM and ChemBL EC50 values (Data Analysis 1)

INTRODUCTION

The OSM competition spreadsheet contains a column labelled "Alternative EC50 from Chembl (uM)". While it is currently unclear how these values were acquired, their presence in the spreadsheet allows for a brief analysis to determine if they correlate to the desired modelling target, "Potency vs Parasite (uMol)". A correlation between these two activities could enable multitask regression modelling which could feature enhanced performance for the Test datasets.

GOAL 

Determine the correlation between OSM and ChemBL EC50 values within the provided competition dataset.

DATASET PREPARATION

Since multiple ChemBL EC50 values may be present within a single cell, all 359 ChemBL EC50 values were extracted from the competition dataset and converted from text to columns in Microsoft Excel. This resulted in the formation of multiple columns containing values for each row (OSM molecule). These values were averaged in order to consolidate the multiple values to a single representative value in a new column called "Mean_AltEC50". Potency vs Parasite (uMol) EC50 values were then carefully inserted adjacent to their corresponding ChemBL EC50 data. 

DATA ANALYSIS METHODOLOGY

The OSM and ChemBL EC50 values were graphed with a scatterplot in Microsoft Excel. Linear, logarithmic, power, and exponential trendlines were fitted to this data. The R^2 values were used as a measure of the correlation between the OSM and ChemBL EC50 values.

RESULTS  

The logarithmic, exponential, and linear trendlines display a poor correlation between OSM and ChemBL EC50 values of less than 0.1 R^2. However, the power trendline features a better correlation with 0.186 R^2. 

Trendline Type R^2
Linear

0.02343

Logarithmic 0.02357
Exponential

0.09176

Power

0.18672

DA1Figure.png
 

COMMENTS AND CONCLUSIONS 

The correlation between ChemBL and OSM EC50 values is non-linear. As such, this correlation could be utilised by multitask neural network models to potentially enhance their predictive performance compared to single task models. The performance of dual task models compared to single task models should be investigated in a follow up experiment.

Future analyses should generate some form of identification that is compatible with Excel's VLOOKUP function instead of relying on sorting the entire dataset.

In Silico Model Prediction of Testset B and C EC50 values (Single Task Modelling Part 2)

GOAL 

To assess the prediction performance of the Progressive Neural Network model on the held out "B" and "C" test sets.

DATASET PREPARATION

The molecules labelled with "B" and "C" Ion regulation Test Set were combined to create a single, 37 molecule Test Dataset. An additional class was also created by transforming the associated "Potency vs Parasite (uMol)" values for these molecules by log10(EC50 + 1).

MODELLING METHODOLOGY

A Progressive Neural Network model was constructed using the datasets described in Part 1 and with the hyperparameters listed below. This model was used to predict 37 log10(EC50 + 1) transformed "Potency vs Parasite (uMol)" values in the test set. The log(x + 1) transformation was then reversed for all predictions to enable comparison with the true "Potency vs Parasite (uMol) values of the test set.

Progressive neural network hyperparameters:

  • Layers: 2
  • Layer dimensions: 1000
  • Number of epochs: 50
  • Dropouts per layer: 0.25
  • Optimizer: Adam
  • Batch size: 100
  • Loss: Root Mean Square Error

RESULTS  

The calculated Root Mean Squared Error for the Progressive neural network model for assessing the combined test set was 4.1340 uMol. 

The true and predicted Potency vs Parasite (uMol) values are displayed below. 

OSM Code Ion Regulation Test Set PotencyuMol PGN_ST_predictions
OSM-S-367 A,B 8.1938 2.404632188
OSM-S-380 A,B 0.11 3.151741719
OSM-S-175 B 0.3475 7.300927026
OSM-S-201 B 4.5956 7.719285267
OSM-S-204 B 0.9018 5.808532719
OSM-S-218 B 0.1105 0.366073106
OSM-S-254 B 0.7744 1.420859794
OSM-S-272 B 0.1078 0.68316042
OSM-S-278 B 4.2154 5.616926461
OSM-S-279 B 0.314275 2.844591687
OSM-S-293 B 0.13 0.987342693
OSM-S-353 B 0.1137 1.776545003
OSM-S-366 B 0.4349 1.629969458
OSM-S-376 B 0.5767 1.354778073
OSM-S-377 B 0.01668 0.153477093
OSM-S-378 B 10 2.057914063
OSM-S-379 B 0.3292 2.85889783
OSM-S-381 B 0.02432 0.957832692
OSM-S-389 B 0.1408 2.532740452
OSM-S-390 B 0.074 1.758208853
OSM-S-363 C 10 2.540437817
OSM-S-364 C 10 0.619554596
OSM-S-368 C 2.239 1.436717336
OSM-S-369 C 0.251 0.985902954
OSM-S-370 C 1.995 3.386147645
OSM-S-371 C 0.372 4.859706705
OSM-S-372 C 10 7.033399774
OSM-S-373 C 10 14.63480484
OSM-S-374 C 10 9.961609289
OSM-S-375 C 10 1.156782548
OSM-S-382 C 10 10.05324279
OSM-S-383 C 0.135 1.212617192
OSM-S-384 C 0.928 1.202246959
OSM-S-385 C 8.586 2.308562764
OSM-S-386 C 4.801 4.868107745
OSM-S-387 C 10 1.059725732
OSM-S-388 C 10 14.21659369

COMMENTS AND CONCLUSIONS 

Many EC50 predictions from this initial modelling effort are often in the wrong order of magnitude of the actual assay result which indicates the need to reduce the RMSE measure well below 4 uM in order to produce a predictive model. This could be achieved by further model tuning (at the risk of overfitting the Testset), multitask/transfer learning of related assay activities to make better use of the limited data, and dataset augmentation to hopefully expand the applicability domain of in silico models and enhance prediction performance for the Series 4 compounds of the Test Sets.