# Attempted synthesis of 3-oxo-2-(4-iodo) phenyl pentanenitrile

4-Iodophenylacetonitrile (250 mg, 1.03 mmol, 1 equiv.), ethyl propionate (102 mg, 0.876mmol, 1.1 equiv.) and potassium tert-butoxide (289 mg, 2.58 mmol, 2.5 equiv.) were combined in THF (2.5 mL) at room temperature, with stirring in a round bottom flask. The reaction mixture turned to a dark red and heated up rapidly.

The first reaction was conducted over a period of 16 hours with stirring. This resulted in a complex mixture of products according to TLC. The second run was only reacted for 4 hours with stirring for the first 30 min to limit the amount of dissolved oxygen introduced to the reaction system. This led to a clean crude TLC.

The reaction mixture was worked up by the addition of 1.0 M HCl (15 mL) to the reaction vessel. The acidified reaction mixture was transferred to a separating funnel and the aqueous layer was extracted with DCM (3 x 20 mL). The combined organic layer was washed with brine and dried with anhydrous sodium sulfate, filtered and concentrated in vacuo to afford a red oil. TLC analysis was conducted with 10:1 hexane : DCM as the eluent. TLC analysis indicated that the majority of the starting material had reacted.

NMR analysis revealed two main products. Out initial thinking is that the two products are the keto and enol forms of the reaction product according to characteristic peaks in the NMR spectra. Alternatively, the second product may be an oxygenated by-product due to the small scale of the reaction.

# Methods of submitted model

A collaboration was formed with Davy Guan to distribute the data preparation for descriptor calculation and model interpretation efforts. 340 unique molecules were collated into the training dataset following curation consisting of averaging the molecule potency values to remove duplicates. The SMILES structures were further curated in ChemAxon Standardizer with the removal of solvents and salts, neutralisation of any charged fragments, and addition of explicit hydrogens. Three-dimensional geometries were initially constructed using the UFF/MMFF94S forcefield, then further optimised using PM7 methodologies in the gas phase. 1,825 two- and three-dimensional physicochemical descriptors were calculated for the optimised structures using the Mordred descriptor calculator package. 21 three-dimensional electronic descriptors were calculated using the CPCM continuum solvation model after further HF-3c geometry optimisation. Permutation feature importance was used to select 50 physicochemical descriptors and 9 electronic descriptors that were determined as most relevant to modelling and predicting PfATP4 potency. QSAR models mapping the 59 descriptors to the concentration values were developed in TPOT, a genetic algorithm-based method to optimise model hyperparameters. Six models were developed and ensembled by averaging their predictions.

All the models making up the final ensemble each featured an mean absolute error between 0.33 to 0.39 in 10-fold cross validation.

# Making Final Predictions

In order to make final predictions, I will use a very simple ensemble method; both of my IC50 prediction and Classificaiton methods will be used in combination to make a final prediction on the compounds.

Note: One of the  test compounds (OSM-LO-1) has a SMILES structure that can't be parsed by RDKIT, so I will just make a hand-noted prediction of '99999' in it's stead.

The final output of my method, using IC50 prediction and a LR classifier, are:

 ids IC50 (nM) Class Active IC50 Alone Ensemble OSM-LO-6 44.6769280216237 1 Yes Yes OSM-LO-5 55.480472126021 1 Yes Yes OSM-LO-2 76.4338544310826 1 Yes Yes OSM-LO-14 101.638922178812 0 Yes No OSM-S-692 106.087178493049 0 Yes No OSM-LO-8 118.143574531465 1 Yes Yes OSM-S-666 134.573385720318 1 Yes Yes OSM-S-683 143.076282012967 1 Yes Yes OSM-LO-10 226.819415668836 1 Yes No OSM-LO-4 229.790935593512 1 Yes No OSM-S-694 244.851037695996 0 Yes No OSM-LO-9 248.86026691543 0 Yes No OSM-S-680 257.773314068201 0 Yes No OSM-LO-7 266.26974195272 1 Yes No OSM-S-691 315.518528635974 0 No No OSM-S-690 335.013695110697 1 No No OSM-S-556 338.630837946452 1 No No OMS-S-685 344.536706929361 0 No No OSM-S-689 360.183488372355 0 No No OSM-LO-11 397.118924578365 1 No No OSM-S-662 426.769361853415 0 No No OSM-S-668 453.922994482435 0 No No OSM-S-693 456.151258271413 0 No No OSM-S-669 465.064949085674 0 No No OSM-S-672 565.109870023023 0 No No OSM-S-687 579.163481827239 0 No No OSM-S-670 601.420382543488 0 No No OSM-S-678 614.919633090005 0 No No OSM-S-673 668.740367852594 0 No No OSM-S-676 788.741168089636 0 No No OSM-S-675 896.707465441918 0 No No OSM-LO-12 997.770340508849 0 No No OSM-S-651 1947.51657703661 0 No No OSM-LO-1 99999 0 No No

Here, do to the uncertainty of of the accuracy of the model, I created a cutoff of ~300nM for the IC50 alone for the contribution to the Ensemble method.

Also due to uncertainty of how the final model will be evaluation, I've include all 3 metrics here. I would likely suggest a reliance on the Ensemble scoring as the final metric, resulting in a total of 6 predicted actives. Interesting, there was discordance on 2 compounds, which don't have the core structure of other S4 compounds;

# Checking Classification Accuracy

Using the script classifier_splits.py script, looked at the ability to properly classify Series4 compounds as active/inactive (anything below 1uM is active, everything above in active). Used ECFP4 (2048 bits) as input, and 3 different models. For each model, ran the calculation 100 times and averaged the std/mean MCC value:

LogisiticRegression:

Mean MCC : 0.45

Std MCC : 0.11

LinearSVM

Mean MCC : 0.41

Std MCC : 0.10

KNN (5 Neighbors):

Mean : 0.35

Std : 0.09

Seems they all have roughly the same performance, with logisticregression (lbfgs solver) being the best, but only still moderate.

# Checking In-Series Validity (Regression)

First step is to see if a neural network can accuratately predict IC50 for S4 within the data itself. Using a 85/15 train/test split (10 times), and the 'regression-splits.py' code, we get the following output:

 Error(MUE) FPTrainCoverage FPTestCoverage FPDistance TrainSE TestSE DiffSE 0.406803206043688 0.7607421875 0.2978515625 1.51090081430421 283.157389740819 241.396673688237 -41.7607160525819 0.625348493063008 0.75146484375 0.30419921875 1.50061470167768 284.266382033001 230.733396391119 -53.532985641882 0.484961726471979 0.75732421875 0.3076171875 1.46847544306137 283.240459964728 237.660696266006 -45.5797636987222 0.426486544983989 0.7353515625 0.408203125 1.52144791060531 279.195660887414 264.922063977684 -14.2735969097294 0.503196972824924 0.7470703125 0.3583984375 1.5595922971518 281.9797972678 251.990136824408 -29.9896604433914 0.48722154874268 0.75390625 0.34716796875 1.52362170328423 280.479354815024 259.130075717435 -21.3492790975894 0.456902186811316 0.75 0.3115234375 1.5216212794691 281.094482407587 252.809307747338 -28.2851746602484 0.389651567524616 0.74853515625 0.3212890625 1.35652142365003 282.158609097749 248.388753134443 -33.7698559633056 0.489750917843627 0.75390625 0.34375 1.58219756455388 282.262265128849 247.462477468405 -34.7997876604435 0.371573041126291 0.75244140625 0.318359375 1.26816736842709 283.39295100471 243.692955000444 -39.6999960042661

Here, the error is in pIC50 units (just log10(IC50), not negative log). The additional columns are internal metrics for the predictive performance of a FP training set in a neural network:

*FPTrainCoverage : % 'on' bits in training set (should be >0.75)

*FPTestCoverage : % 'on' bits in test set (measure of diversity)

*FPTestDistance : L2 Norm between average FPTrain and FPTest (should be ~<1 for accuracies of ~1 pIC50 unit or better)

*FPTrain/TestSE : Measure of the shannon entropy of the train/test sets for FPs

*DiffSE : Difference in the Shannon Entropies between the two sets

Overall, happy with the performance of this, as we seem to have _decent_ coverage of the input space (~0.75).

In this, we used ECFP4 from rdkit as our FPs (depth 4, 2048 bits), and a multi-layered NN (3 training layers with 2 exteremely modest Dropout layers)

Hello world

# Parameter Sweep for ECFP Depth and #bits

Next, a sweep of parameters on the ECFP was performed for both the LR and SVM method, considering the following:

EC Depth = [4,5,6]

EC #Bits = [1024, 2048, 4096]

Again, 125 train/test splits were performed, and the distribution of MCC values were calculated.

https://imgur.com/p5UAUFx

Here, the means of the MCC values for each model were calculated.

Method_EC Depth_#Bits MeanMCCValue

svm_6_1024 0.594463

lr_6_1024 0.594463

svm_5_1024 0.642099
lr_5_1024 0.642099
svm_4_1024 0.648357
lr_4_1024 0.648357
lr_6_2048 0.653864
svm_6_2048 0.653864
lr_5_2048 0.658755
svm_5_2048 0.658755
svm_6_4096 0.658897
lr_6_4096 0.658897
svm_5_4096 0.667104
lr_5_4096 0.667104
svm_4_4096 0.667121
lr_4_4096 0.667121
lr_4_2048 0.673339
svm_4_2048 0.673339

The best methods were SVM and LR at depth 4, with 2048 bits. This gave an average MCC value of 0.67 +/- 0.1 for both.

# Initial Model Search

Picking Initial Methods

With the goal in mind of being able to classify potential S4 compounds, an initial search for a well suited classification method was undertaken. All compounds in the database that met the following criteria were used in the model search:

* Compounds with SMILES strings

* Compounds with  Ion Activity, and were either a 0 or 1

This resulted in 575 compounds being used. Next, the following models were used (with default settings) from sklearn to classify compounds as either 0 or 1 (Ion Activity Assay) as the class, and RDKit ECFP4 (2048-bit) fingerprints as the inputs : KNN, Linear SVM, Random Forest, Naive Bayes, Decision Trees, and Logisitic Regression. To determine which model was the most accurate, a train/test split (80/20) was done 125 times, and for each loop, a model was built for each method, and the matthews correlation coefficient (MCC) was calculated as an unbiased and accurate measure of model accuracy. The distributions of these MCC scores for each model were then compared.

https://imgur.com/y8uHYSi

Treating the MCC values as distributions, the Kolmogorov-Smirnov statistic was calculated to determine the p-values of distribution similarity for each method:

Method A, Method B, p-value
mcc_knn,mcc_knn,1.0
mcc_knn,mcc_svm,0.13700610573284444
mcc_knn,mcc_rf,0.007449442574861611
mcc_knn,mcc_nb,8.296026497590731e-38
mcc_knn,mcc_dt,3.5280572995108e-21
mcc_knn,mcc_lr,0.987342261870452
mcc_svm,mcc_knn,0.13700610573284444
mcc_svm,mcc_svm,1.0
mcc_svm,mcc_rf,4.409900257709484e-05
mcc_svm,mcc_nb,8.500551823859001e-41
mcc_svm,mcc_dt,3.535605015038742e-25
mcc_svm,mcc_lr,0.18293778552780215
mcc_rf,mcc_knn,0.007449442574861611
mcc_rf,mcc_svm,4.409900257709484e-05
mcc_rf,mcc_rf,1.0
mcc_rf,mcc_nb,1.677584074335309e-31
mcc_rf,mcc_dt,1.4852492791766038e-16
mcc_rf,mcc_lr,0.03647438799031367
mcc_nb,mcc_knn,8.296026497590731e-38
mcc_nb,mcc_svm,8.500551823859001e-41
mcc_nb,mcc_rf,1.677584074335309e-31
mcc_nb,mcc_nb,1.0
mcc_nb,mcc_dt,1.096954798445088e-14
mcc_nb,mcc_lr,8.296026497590731e-38
mcc_dt,mcc_knn,3.5280572995108e-21
mcc_dt,mcc_svm,3.535605015038742e-25
mcc_dt,mcc_rf,1.4852492791766038e-16
mcc_dt,mcc_nb,1.096954798445088e-14
mcc_dt,mcc_dt,1.0
mcc_dt,mcc_lr,1.2305157079847292e-20
mcc_lr,mcc_knn,0.987342261870452
mcc_lr,mcc_svm,0.18293778552780215
mcc_lr,mcc_rf,0.03647438799031367
mcc_lr,mcc_nb,8.296026497590731e-38
mcc_lr,mcc_dt,1.2305157079847292e-20
mcc_lr,mcc_lr,1.0

The Linear SVM and Logistic Regression methods were best, with average MCC values of 0.67 +/- 0.11 and  0.64 +/- 0.11 respectively, and were statistically signficant in their difference of distribution from the rest of the methods (but not significantly different from one another).

Moving forward, we will explore Linear SVMs and LR as our base methods, and explore some light parameter searching to determine if we can improve the performance.

# Cleaning data for series 4 comp.

8-9-2019

Data Processing

For the purpose of supplying data for building the ML model, the data set for ION Regulation DATA was downloaded from http://tinyurl.com/OSM-Series4CompData as a .csv on Friday August 9, 2019.

Ran the attached pyton script to keep Potency vs Parasite (uMol), Ion Regulation Activity, Ion Regulation Test Set and Smiles. All data rows contining NaNs were dropped.

The attached ouput file contains the relevent data to be used in our model building.

# Synthesis of 2-Chloro-3-(4-chlorophenyl)-[1,2,4]triazolo[4,3-a]pyrazine

2-chloro-3-(4-chlorophenyl)-[1,2,4]triazolo[4,3-a]pyrazine was synthesised from the previously synthesised 2-chloro-6-(2-hydrazinyl)(4-chlorobenzylidene)pyrazine intermediate.

Reaction Scheme

Reaction 2 scheme.png

Risk assessment

4chlorobenzaldehyde reaction 2 risk assessment.pdf

Reagents

mass/g      moles/mmol       equivalents

2-chloro-6-(2-hydrazinyl)(4-chlorobenzylidene)pyrazine       0.0201       0.062                   1.0

diacetoxyiodobenzene                                                            0.0310       0.096                   1.5

Proceedure

A round bottomed flask was charged with Synthesis of 2-Chloro-6-(2-hydrazinyl)(4-chlorobenzylidene)pyrazine (0.0212 g, 0.067 mmol), diacetoxyiodobenzene (0.0250 g, 0.078 mmol) and dichloromethane (10 mL) in. The resulting solution was stirred at room temperature and monitored using TLC (7: 3 ethyl acetate: light petroleum ether solvent used) until completion. The resulting product was then isolated by column chromatography, using the same solvents (Rf = 0.4) and dried via vacuum filtration yielding a white powder (0.0181g, 85.7%). The product was characterised via 1H and 13C NMR, infrared and melting point analysis.

Analytical Data

4chlorobenzaldehyde reaction2 H NMR spectrum.pdf

4chlorobenzaldehyde reaction2 IR spectrum.pdf

4chlorobenzaldehyde reaction2 Mass spectrum.pdf

4chlorobenzaldehyde reaction2 C NMR spectrum.pdf

Mpt = 261-265C

Inchi keys

2-chloro-6-(2-hydrazinyl)(4-chlorobenzylidene)pyrazine IREXIIXDQDGVMX-UHFFFAOYSA-N

diacetoxyiodobenzene  ZBIKORITPGTTGI-UHFFFAOYSA-N

2-chloro-3-(4-chlorophenyl)-[1,2,4]triazolo[4,3-a]pyrazine VCWFDFYRPOTNNX-UHFFFAOYSA-N