Hypothesis: Homo sapiens pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.
Start date: 2014-09-01 End date: 2014-09-01
Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.
By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.
- Download the GPML files from WikiPathways
- Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
- A Groovy script to iterate over the GPML, find <Label> elements
- Each <Label> is parsed with OPSIN and if successful, generate an InChI
- Use the InChIs to find ChemSpider identifiers
- Output all as a text file and open metabolites in a Structure table
Similar to the experiment for Anopheles gambiae only curated pathways were analyzed, some 266 in total, downloaded from WikiPathways.org on August 24. The previous Groovy script was updated to point to the human pathways, but also to output the results in a file, rather than STDOUT. The new script was uploaded to myExperiment.org.
The script found 42 Labels that are possibly metabolites. The full list was uploaded to Gist. Again, labels were found which could not be linked to a single ChemSpider ID. For example, “5b-Pregnane-3,20-dione” which will results in these ChemSpider search hits: 21427590, 389575, 21232692, 21239075, 21237402. The result file also shows a few labels with new lines.
One metabolite was manually confirmed in WP1449: Imidazoquinolin. Interestingly, the Label was visually “connected” with “(anti-viral compounds)” which have a ChEBI identifier and could be converted to a DataNode of type Metabolite too:
Most work, however, needs to be done in the Tryptophan metabolism pathway (WP465); many metabolites are not properly made machine readable.
Human pathways also include metabolites encoded in GPML <Label> elements, even in the curated subset.