Monthly Archives: April 2017

EW9: retrieving active drug-target interactions from ChEMBL

Hypothesis: the pChEMBL property can be used to filter for active drug-target interactions

Start date: 2017-04-17 End date: 2017-04-17

Description: If we wish to integrate binding affinity data with other data sets, we need to have access to this data. ChEMBL is a CC-BY data source that provide literature reported data and provides a SPARQL end point to provides access to it. The predicate for the pChEMBL value can be used to filter and list only active (pChEMBL>=5) interactions.

Methods

  1. Create a SPARQLquery that lists all binding affinities
  2. Create a SPARQL query that only selects those with a high pChEMBL value

Report

The example query for activity data for gleevec was used as a starting point. In all following SPARQL the following prefixes are used:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/>

Then, the predicate IRI was searched for the pChEMBL value, with a general SPARQL query:

SELECT DISTINCT ?pred
 WHERE {
  [] ?pred []
 }

The predicate for the pChEMBL value showed to be http://rdf.ebi.ac.uk/terms/chembl#pChembl. This can be combined with the example query to result in the following query to count the number of activities with a defined pChEMBL value:

SELECT COUNT(?activity)
 WHERE {
  ?activity a cco:Activity ;
  cco:hasMolecule ?mol ;
  cco:hasAssay ?assay ;
  cco:pChembl ?pchembl .
  ?assay cco:hasTarget ?target .
  ?target cco:hasTargetComponent ?targetcmpt .
  ?targetcmpt cco:targetCmptXref ?uniprot .
  ?uniprot a cco:UniprotRef .
 }

This returns (today) 2303708 as activity count. A FILTER clause can be added to count just the actives, returning 1404136 active drug-target interactions:

SELECT COUNT(?activity)
 WHERE {
  ?activity a cco:Activity ;
  cco:hasMolecule ?mol ;
  cco:hasAssay ?assay ;
  cco:pChembl ?pchembl .
  ?assay cco:hasTarget ?target .
  ?target cco:hasTargetComponent ?targetcmpt .
  ?targetcmpt cco:targetCmptXref ?uniprot .
  ?uniprot a cco:UniprotRef .
  FILTER(?pchembl>=5)
 }

To return the actual interactions as three columns, the following query is used:

SELECT ?mol ?uniprot ?pchembl
 WHERE {
  ?activity a cco:Activity ;
  cco:hasMolecule ?mol ;
  cco:hasAssay ?assay ;
  cco:pChembl ?pchembl .
  ?assay cco:hasTarget ?target .
  ?target cco:hasTargetComponent ?targetcmpt .
  ?targetcmpt cco:targetCmptXref ?uniprot .
  ?uniprot a cco:UniprotRef .
  FILTER(?pchembl>=5)
 }

If you insist on just getting old fashion identifiers, use a bit of substr() magic:

SELECT (substr(str(?mol),47) as ?molShort)
 (substr(str(?uniprot),33) as ?uniprotShort)
 ?pchembl
 WHERE {
  ?activity a cco:Activity ;
  cco:hasMolecule ?mol ;
  cco:hasAssay ?assay ;
  cco:pChembl ?pchembl .
  ?assay cco:hasTarget ?target .
  ?target cco:hasTargetComponent ?targetcmpt .
  ?targetcmpt cco:targetCmptXref ?uniprot .
  ?uniprot a cco:UniprotRef .
  FILTER(?pchembl>=5)
 }

Finally, to download all the data, curl was used with this last query saved us getData.rq:

curl -H "Accept: text/csv" --data-urlencode query@getData.rq \
 -G https://www.ebi.ac.uk/rdf/services/chembl/sparql -o data.csv

Conclusion:

The latter command returns the drug-target interaction with a pChEMBL value greater than (or equal to) five. However, the SPARQL endpoint throttles the maximum number of returned values to 1000. At this moment I am not sure how to overcome that limit.

References

  • Willighagen, E. L., Waagmeester, A., Spjuth, O., Ansell, P., Williams, A. J., Tkachenko, V., Hastings, J., Chen, B., Wild, D. J., May 2013. The ChEMBL database as linked open data. Journal of Cheminformatics 5 (1), 23+. http://dx.doi.org/10.1186/1758-2946-5-23
  • Willighagen, E., Alvarsson, J., Andersson, A., Eklund, M., Lampa, S., Lapins, M., Spjuth, O., Wikberg, J., 2011. Linking the resource description framework to cheminformatics and proteochemometrics. Journal of Biomedical Semantics 2 (Suppl 1), S6+. http://www.jbiomedsem.com/content/2/S1/S6
  • Gaulton, A., Hersey, A., Nowotka, M., Bento, A. P., Chambers, J., Mendez, D., Mutowo, P., Atkinson, F., Bellis, L. J., Cibrián-Uhalte, E., Davies, M., Dedman, N., Karlsson, A., Magariños, M. P., Overington, J. P., Papadatos, G., Smit, I., Leach, A. R., Jan. 2017. The ChEMBL database in 2017. Nucleic Acids Research 45 (D1), D945-D954.  http://dx.doi.org/10.1093/nar/gkw1074
  • Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton, A., Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S. M., Martin, M., Le Novere, N., Parkinson, H., Birney, E., Jenkinson, A. M., Jan. 2014. The EBI RDF platform: linked open data for the life sciences. Bioinformatics 30 (9), 1338-1339.  http://dx.doi.org/10.1093/bioinformatics/btt765