# EW8: replacing CAS registry numbers with free ChEBI identifiers

Hypothesis: CAS registry numbers can be replaced with free ChEBI identifiers.

Start date: 2014-12-07 End date: YYYY-MM-DD

Description: CAS registry numbers are non-free identifiers for chemical substances. BridgeDb does not have identifier mappings for them, as it is legally not allowed to create such (large) mapping databases (without explicit, non-transferable approval). Because of the lack of these databases, PathVisio cannot map experimental data to <gpml:DataNode>s in WikiPathways with such identifiers.

The goal of this experiment is to replace CAS registry numbers with ChEBI identifiers for which many more mappings are available in BridgeDb-provided identifier mapping files.

Methods

• generate a list of <gpml:DataNode>s with CAS registry data sources on WikiPathways
• for each, verify the chemical identity with the CAS reference database common-chemistry.org
• based on this chemical identity look up the matching ChEBI entry
• replace the identifier in the WikiPathways (e.g. via GPML editing)

Report

$DETAILS OF ACTUALLY PERFORMING THINGS Conclusion:$CONCLUSIONS

# EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Hypothesis: The GPML format has sufficient information to convert a metabolite encoded as a Label into a DataNode with identifier

Start date: 2014-09-04 End date: 2014-09-06

Description:

The GPML format is used by WikiPathways to internally store pathways. The format is human-readable, allowing for adding missing information. Particularly, it can be used to convert a list of metabolites as <Label> elements as <DataNode> elements. Lists of potential <Label> elements to be converted is outlined in other experiments, such as EW6.

For example:

 <Label TextLabel="Acetyl-CoA" GraphId="c7c">
  <Graphics CenterX="150.0" CenterY="640.0" Width="90.33333333333333" Height="19.0" ZOrder="28672" FillColor="ffffff" FontWeight="Bold" FontSize="12" Valign="Middle" />
 </Label>

This can be converted into:

 <DataNode TextLabel="Acetyl-CoA" GraphId="c7c" Type="Metabolite">
  <Graphics CenterX="150.0" CenterY="640.0" Width="90.33333333333333" Height="19.0" ZOrder="28672" FillColor="ffffff" FontWeight="Bold" FontSize="12" Valign="Middle" />
  <Xref Database="ChEBI" ID="CHEBI:15351" />
 </DataNode>

Methods

• Open a WikiPathways page in the MediaWiki edit mode
• Remove one or more <Label> elements to convert
• Convert the start and end tag from Label to DataNode
• Add the Type=”Metabolite” attribute (with value)
• Add a <Xref> child element, preferable with identifier for that metabolite
• Place the new <DataNode> elements just above the first <Interaction> element

Report

Many pathways have been update using this approach in the past, but I had not previously written up the method I used. In the past few days, these are example pathways updated this way:

When there are many <Labels> to be converted, I commonly use a plain text editor and “replace” functionality.

It should be noted that graph identifiers do not get changed, so that links between elements in the GPML are preserved.

Conclusion: This method requires experience with manually editing XML files; the risk is that you break the GPML file, though the WikiPathways interface does validate the file before saving against the GPML XML Schema.

# EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names

Hypothesis: Rattus norvegicus pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-05 End date: 2014-09-05

Description:

WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here, here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

Unchanged protocol.

• Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
• A Groovy script to iterate over the GPML, find <Label> elementsEach <Label> is parsed with OPSIN and if successful, generate an InChI
• Use the InChIs to find ChemSpider identifiers
• Output all as a text file and open metabolites in a Structure table

Report

Similar to the experiment for Anopheles gambiae and Homo sapiens only curated pathways were analyzed, 143 in total, downloaded from WikiPathways.org on August 24. The Groovy script is used detailed in this experiment.

The script found 47 Labels that are possibly metabolites in 8 different rat pathways. The full list was uploaded to Gist.

Conclusion: Rat pathways also include metabolites encoded in GPML <Label> elements.

# EW5: Finding nodes in Homo sapiens pathways with IUPAC names

Hypothesis: Homo sapiens pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-01 End date: 2014-09-01

Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

• Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
• A Groovy script to iterate over the GPML, find <Label> elements
• Each <Label> is parsed with OPSIN and if successful, generate an InChI
• Use the InChIs to find ChemSpider identifiers
• Output all as a text file and open metabolites in a Structure table

Report

Similar to the experiment for Anopheles gambiae only curated pathways were analyzed, some 266 in total, downloaded from WikiPathways.org on August 24. The previous Groovy script was updated to point to the human pathways, but also to output the results in a file, rather than STDOUT. The new script was uploaded to myExperiment.org.

The script found 42 Labels that are possibly metabolites. The full list was uploaded to Gist. Again, labels were found which could not be linked to a single ChemSpider ID. For example, “5b-Pregnane-3,20-dione” which will results in these ChemSpider search hits: 21427590, 389575, 21232692, 21239075, 21237402. The result file also shows a few labels with new lines.

One metabolite was manually confirmed in WP1449Imidazoquinolin. Interestingly, the Label was visually “connected” with “(anti-viral compounds)” which have a ChEBI identifier and could be converted to a DataNode of type Metabolite too:

Most work, however, needs to be done in the Tryptophan metabolism pathway (WP465); many metabolites are not properly made machine readable.

Conclusion:

Human pathways also include metabolites encoded in GPML <Label> elements, even in the curated subset.

# EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names

Hypothesis: Anopheles gambiae pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-08-24 End date: 2014-08-24

Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [no published] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

• Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
• A Groovy script to iterate over the GPML, find <Label> elements
• Each <Label> is parsed with OPSIN and if successful, generate an InChI
• Use the InChIs to find ChemSpider identifiers
• Output all as a text file and open metabolites in a Structure table

Report

Twelve WikiPathways for Anopheles gambiae were downloaded part of the analysis collection. In the future, uncurated pathways can also be included, anticipating to have more metabolites not annotated as Metabolite type. A custom Groovy script for Bioclipse was used, based on a previous similar script available from myExperiment.org. The updated script has been made available on myExperiment.org too. The results of running this script are visible in the above screenshot.

Key calls to Bioclipse managers used in this script, in addition to using the Groovy XMLParser, are:

• cdk.createMoleculeList()
• opsin.parseIUPACName(name)
• inchi.generate(molecule)
• chemspider.resolve(inchiKey)

Four metabolites were found, in one pathway (WP1230):

Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node b93 -> Serine -> MTCFGRXMJLQNBG-UHFFFAOYSA-N -> CSID: [597]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node ff7 -> Glycine -> DHMQDGOQFOQNFH-UHFFFAOYSA-N -> CSID: [730]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node c8c -> Deoxythymidine monophosphate -> WVNRRNJFRREKAR-UHFFFAOYSA-N -> CSID: [315142]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node a47 -> Deoxyuridine monophosphate -> JSRLJPSBLDHEIO-UHFFFAOYSA-N -> CSID: [21537275, 668, 21230588]

Three metabolites have a single ChemSpider identifier, whereas one has three ChemSpider identifiers.

Visual inspection of WP1230 (revision 68447) confirms our hypothesis:

Conclusion: Anopheles gambiae pathways indeed also include metabolites encoded in GPML <Label> elements.