I tried a ligand-centric, QSAR approach to model activity for Series 4. I took the S4 compounds on the April 7, 2018 Master List and separated them into two .csv files, one containing compounds with measured activities and one with unmeasured compounds.
For both data sets, I used the SMILES strings to calculate 1D and 2D molecular descriptors using PaDEL (http://www.yapcwsoft.com/dd/padeldescriptor/), which is freely available. PaDEL calculates 1,444 1D and 2D descriptors. It was not obvious how to proceed with putting together a single activity value for the compounds, as they were measured in different assays, and many had values such as >10 or >50. For values such as >X, I entered a value that was 2X. I tried various methods to perform regression in Weka (https://www.cs.waikato.ac.nz/~ml/weka/), freely available, focusing on the more interesting machine learning methods. 20% of the data was set aside as a test set for cross-validation. Random forest performed by far the best, using 136 trees, with a max depth of 16, and max features of 6 per tree.
Statistics calculated for the training vs test predictions:
Correlation coefficient 0.7178
Kendall's tau 0.4993
Spearman's rho 0.647
Mean absolute error 7.1934
Root mean squared error 12.3386
Relative absolute error 78.7046 %
Root relative squared error 70.3855 %
Using xgboost (extreme gradient boosted trees, https://github.com/dmlc/xgboost) through python, I got better results (with 120 estimators, max depth=3, learning rate =0.1, subsample = 0.9).
Kendall's tau: 0.49
Spearman's rho: 0.62
Mean squared error: 49.7.
I then trained xgboost on the total set of measured compounds (training + test). The model was saved as are the predictions made for the training and test sets. I applied the final model to the set of untested compounds. The most promising compounds are in order, OSM-S-486, OSM-S-433, OSM-S-536, OSM-S-538, OSM-S-204.