Hypothesis: the pChEMBL property can be used to filter for active drug-target interactions
Start date: 2017-04-17 End date: 2017-04-17
Description: If we wish to integrate binding affinity data with other data sets, we need to have access to this data. ChEMBL is a CC-BY data source that provide literature reported data and provides a SPARQL end point to provides access to it. The predicate for the pChEMBL value can be used to filter and list only active (pChEMBL>=5) interactions.
Methods
- Create a SPARQLquery that lists all binding affinities
- Create a SPARQL query that only selects those with a high pChEMBL value
Report
The example query for activity data for gleevec was used as a starting point. In all following SPARQL the following prefixes are used:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#> PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/>
Then, the predicate IRI was searched for the pChEMBL value, with a general SPARQL query:
SELECT DISTINCT ?pred WHERE { [] ?pred [] }
The predicate for the pChEMBL value showed to be http://rdf.ebi.ac.uk/terms/chembl#pChembl. This can be combined with the example query to result in the following query to count the number of activities with a defined pChEMBL value:
SELECT COUNT(?activity) WHERE { ?activity a cco:Activity ; cco:hasMolecule ?mol ; cco:hasAssay ?assay ; cco:pChembl ?pchembl . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef . }
This returns (today) 2303708 as activity count. A FILTER clause can be added to count just the actives, returning 1404136 active drug-target interactions:
SELECT COUNT(?activity) WHERE { ?activity a cco:Activity ; cco:hasMolecule ?mol ; cco:hasAssay ?assay ; cco:pChembl ?pchembl . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef . FILTER(?pchembl>=5) }
To return the actual interactions as three columns, the following query is used:
SELECT ?mol ?uniprot ?pchembl WHERE { ?activity a cco:Activity ; cco:hasMolecule ?mol ; cco:hasAssay ?assay ; cco:pChembl ?pchembl . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef . FILTER(?pchembl>=5) }
If you insist on just getting old fashion identifiers, use a bit of substr() magic:
SELECT (substr(str(?mol),47) as ?molShort) (substr(str(?uniprot),33) as ?uniprotShort) ?pchembl WHERE { ?activity a cco:Activity ; cco:hasMolecule ?mol ; cco:hasAssay ?assay ; cco:pChembl ?pchembl . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef . FILTER(?pchembl>=5) }
Finally, to download all the data, curl was used with this last query saved us getData.rq:
curl -H "Accept: text/csv" --data-urlencode query@getData.rq \ -G https://www.ebi.ac.uk/rdf/services/chembl/sparql -o data.csv
Conclusion:
The latter command returns the drug-target interaction with a pChEMBL value greater than (or equal to) five. However, the SPARQL endpoint throttles the maximum number of returned values to 1000. At this moment I am not sure how to overcome that limit.
References
- Willighagen, E. L., Waagmeester, A., Spjuth, O., Ansell, P., Williams, A. J., Tkachenko, V., Hastings, J., Chen, B., Wild, D. J., May 2013. The ChEMBL database as linked open data. Journal of Cheminformatics 5 (1), 23+. http://dx.doi.org/10.1186/1758-2946-5-23
- Willighagen, E., Alvarsson, J., Andersson, A., Eklund, M., Lampa, S., Lapins, M., Spjuth, O., Wikberg, J., 2011. Linking the resource description framework to cheminformatics and proteochemometrics. Journal of Biomedical Semantics 2 (Suppl 1), S6+. http://www.jbiomedsem.com/content/2/S1/S6
- Gaulton, A., Hersey, A., Nowotka, M., Bento, A. P., Chambers, J., Mendez, D., Mutowo, P., Atkinson, F., Bellis, L. J., Cibrián-Uhalte, E., Davies, M., Dedman, N., Karlsson, A., Magariños, M. P., Overington, J. P., Papadatos, G., Smit, I., Leach, A. R., Jan. 2017. The ChEMBL database in 2017. Nucleic Acids Research 45 (D1), D945-D954. http://dx.doi.org/10.1093/nar/gkw1074
- Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton, A., Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S. M., Martin, M., Le Novere, N., Parkinson, H., Birney, E., Jenkinson, A. M., Jan. 2014. The EBI RDF platform: linked open data for the life sciences. Bioinformatics 30 (9), 1338-1339. http://dx.doi.org/10.1093/bioinformatics/btt765
With >6 it reduces to 815k, with >7 it is 469k, and with >9 it is 212k activities.