# EW3: Exposing more Jena functionality in Bioclipse

Hypothesis: Jena functionality for triple set comparison can be exposed via Bioclipse script

Start date: 2014-08-20 End date: 2014-08-20

Description: Bioclipse in the development branch mostly uses Jena for handling RDF data. The “rdf” manager already exposes various features of Jena. We here wish to expose the functionality of Jena to make unions, intersections, and differences of two triple stores (“Model”s in Jena terms) and to use the experimental shortest path method from the OntTool class.

Methods

• define additional methods in the IRDFManager interface with the proper code
• write implementations of these methods in the RDFManager class
• publish the patches

Report While I still had a development environment from this step, Bioclipse no longer properly booted. Between that experiment and this one various things happened:

1. Bioclipse had a new target platform
2. I moved to a 64bit operating system
3. I only reinstalled Java8

For resetting the target platform the normal protocol was used, though I had to repeat it a few times to get fully working. As usual, I first had to ask Arvid in Uppsala before it really started working (#overlyhonestmethods). There may have been a confounding issue with not having the proper javax.xml.soap version in my installation, and Arvid’s suggestion to manually remove the java.xml plugin from the target platform, via the Content tab.

A further issue was found in using Java8 which has a different provider for the JavaScript extension. As a result, the Bioclipse JavaScript console did not start. Apparently, my Java 8_11 installation in Eclipse does not provide any scripting environment (tested by asking the ScriptEngineManager for all engines; none were reported). Because the nashorn.jar that contains the implementation was provided by Oracle as a separate jar, containing an open source JavaScript engine originally by Mozilla, now provided via OpenJDK, I could include this jar in the Bioclipse plugin, solving these issues. Along with a few other patches, these tweaks are available in this branch on GitHub. These patches are not pushed for inclusion in the Bioclipse development branch.

The test suite was not extended and not run as “JUnit Plug-in Test” using Eclipse, because my development environment is not able to properly run these at this moment. Instead, the functionality was tested using the rdf manager from the JavaScript console with this script:

store = rdf.createInMemoryStore()

rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object"
);
rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object2"
);
secondStore = rdf.createInMemoryStore();
rdf.addObjectProperty(secondStore,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object"
);
rdf.addDataProperty(secondStore,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "someDataObject"
);

unionStore = rdf.union(store, secondStore);
diffStore = rdf.difference(store, secondStore);
intersectStore = rdf.intersection(store, secondStore);
rdf.asTurtle(diffStore)

This showed expected results, with the exception that the Jena code makes default triples more visible. That is, converting the store to Turtle shows two triples, even though it has an additional 39~ish additional triples from the RDF and RDF Schema specifications. Weirdly, making a union of the store and secondStore, the number of triples increases to about 150 and converting this to Turtle does serialize all those RDF and RDF Schema triples. I have been unable to work around this feature.

The above three methods are easily wrapped, but the shortest path functionality requires an additional step: the OntTool return value is a toolkit specific type (Path) and the rdf manager was designed to convert this to a java.util.List of Strings. This functionality too was tested via the JavaScriptConsole:

store = rdf.createInMemoryStore()

rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#subject2"
);
rdf.addObjectProperty(store,
  "http://example.com/#subject2",
  "http://example.com/#predicate",
  "http://example.com/#subject3"
);

rdf.shortestPath(store,
  "http://example.com/#subject",
  "http://example.com/#subject3"
);

The manager provides two variants at this moment of this shortestPath() method: the above exemplified version and one that takes a fourth parameter, being a String representation of an URI matching the only predicate that can be part of the path. Both methods were found to work as expected based on the above code. No application to larger data sets have been tried.

Resulting patches have been provided as a pull request.

Conclusion: Exposing the additional functionality yielded an more functional rdf manager with interesting new features.

# EW2: Validating and Reviewing RDF for Open PHACTS

Hypothesis: The proprietary RDF is valid and uses common ontologies

Start date:  2014-07-23 End date: 2014-07-23

Description:

For obvious reasons, this experiment will not disclose all details. It will outline, however, the steps I undertook to do the validation and evaluation.

Methods:

• determine format
• validate basic syntax
• inspect triple structure
• inspect all used ontologies
• match results against Open PHACTS habits

Report:

The provided RDF document is in the RDF/XML format. It validates as well-formed XML, with xmllint (Debian:libxml2-utils):

xmllint --noout file.rdf

The document is not linked to a DTD or XML Schema (as is common with RDF/XML). Parsing the library with rapper (Debian:raptor2-utils) does not find problems either:

cat file.rdf | rapper -t -q - . > /dev/null

Using the –count option, 72 triples are found in the sample RDF. The RDF was converted into Turtle with:

rapper -o turtle file.rdf > file.ttl

This resulted in a file with 101 lines.

Manual inspection of the Turtle file shows that it has nine resources of five different types. Resources are not formally types using rdf:type, but the type is clear from the resource IRI. Most properties are provided as literals, including identifiers. The latter could use identifiers.org style identifiers, or RDF IRIs provided by upstream databases. The structure looks reasonable, with one type at the center, pointing to the four other types with four different predicates.

The document uses mostly a custom, undocumented ontology, where term IRIs have human readable forms. Common ontologies used include Dublin Core, and BIBO. Ontologies are looked up at the BioPortal project page (http://bioportal.bioontology.org/projects/Open_PHACTS); the BioAssay Ontology, ChEBI, and QUDT ontologies, as found there, are not used.

The expected VoID descriptions with provenance information is missing (see the Dataset Descriptions for the Open Pharmacological Space specification).

Conclusion:

The RDF is in good shape, but can be improved. It is valid and is human readable. It should, however, make more use of ontologies already in practice. Importantly, the data should be complemented with VoID descriptions.