|
Proposal Name : Uris, CVs, Namespace usage |
Proposal Type : Improvement |
Editor(s): Alan Ruttenberg |
Status: Notes, Beginning Implementation |
|
Input from: Matthias Samwald, Andrea Splendiani, Peter Karp, Gary Bader, Jonathan Rees, Emek Demir, Nicolas Le Novere |
|||
- Summary
- Desiderata
- Requirements
- Definitions
- Implementation
- Worked Examples
- Open Issues
- Expected growth and plan for growth
- Backward Compatibility
- Discuss the proposal
Summary
Regularize the use of URIs, controlled vocabularies and Namespace usage in order to better facilitate data integration. This is a proposal relative to the DX proposal, and so necessarily doesn't handle the complete issue. However it should be an improvement to the current spec, without being burdensome.
Presentation probably does a better job than this proposal currently:
http://mumble.net/~alanr/cshl/URICV.htm
Desiderata
Some way to easily determine which CVs a provider has chosen to fill specific CV valued fields with (Peter Karp)
Xrefs/CV terms URI constructed in standard way, eg.
http://xref.biopax.org/xref/<db name goes here>#<id goes here> (unless the provider already has a published URI scheme) Wiki based registry of db/cv names so we all use the same name for the same db.
unification/relation moved from subtype of Xref to subpropery of XREF. UNIFICATION-XREF inverseFunctional captures meaning.
xref/cv get extra property to point to user friendly web page, if provider wants to provide it
use SKOS to embed CVs in BioPAX when providers wish to. Explain how to ignore them if you want to.
a way to handle biopax-level1:ID, biopax-level2:ID integration problems
Some properties become annotation properties.
Build it to work with DX - compromise where necessary, but give us experience for the future.
Providers put their content in their own namespace.
meta: Try to use keywords from
http://www.ietf.org/rfc/rfc2119.txt in writing the proposal Some: Principles (to enable provider defined content without making it too hard to ignore it for those who would)
No new URIs in biopax-level2 namespace.
New classes and properties may be created - but may be ignored by some by ignoring the class definitions, and by ignoring non-biopax rdf:types.
Segregate any new classes in separate files, and use owl:import(define some standard annotation properties to put on these ontologies to identify the sorts of things in them).
Any instance types which are not in biopax-level2 namespace may be ignored. Every instance in primary file must have (at least) one type which is in the biopax-level2 namespace (and it must be serialized so that the biopax class is used as the rdfxml tag for the instance).
XML diehards can ignore: imports, any triples whose property isn't in the biopax-level2 namespace, all class definitions, and any rdf:type triples whose object isn't in the biopax-level2 namespace. (supply a sample program that filters these out)
Requirements
Minimal burden on those who still parse BioPAX using XML
Criteria for determining equality of external references, to ease data integration.
Enabling provider definition of classes, properties and hierarchically organized terms
Increased clarity on the meaning of rdf:id for external references
Explicit instructions on the use of namespaces in BioPAX documents.
Works with loosely coordinated providers.
A mechanism for providers to define which sources of terms are used as values of which properties.
Be able to define "home grown" controlled vocabularies using SKOS.
Definitions
BioPAX DX: An effort to design a intermediate release to provide for the short term needs of data exhange between data providers.
BioPAX DX instance document: General practice in the past has been to have a BioPAX document, which either includes the BioPAX specification, or owl:imports it, and otherwise contains instances of BioPAX classes. In this proposal we will allow other OWL documents to be imported to this file, and so to distiguish it from the imported documents, we name it the BioPAX DX instance document
BioPAX OWL specification document: The OWL document which contains class definitions for BioPAX. Available for owl:importing from
http://www.biopax.org/release/ BioPAX DX definition document: An OWL document which may contain additional classes, properties, or instances (particularly instances of skos:Concept) which may be imported into a BioPAX DX document.
Term: The general word we will use for things that are either defined externally to BioPAX, such as references to external database entries, or controlled vocabularies from different sources, or elements of controlled vocabularies defined by a BioPAX providers. skos:Concept
BioPAX DX provider document: An OWL file, with content that follows the BioPAX DX proposal. Either a BioPAX DX definition document, or BioPAX DX instance document.
Namespace: An initial substring of a URI. See
http://www.w3.org/TR/REC-xml-names/. BioPAX Namespaces: Any namespace that starts with "
http:///www.biopax.org/" BioPAX-as-XML consumer: A (typical?) BioPAX DX user, who processes OWL as if it was XML.
BioPAX skos subset: A subset of the SKOS ontology used in BioPAX
Implementation
BioPAX DX provider documents MUST NOT include definitions of instances, classes, or properties with rdf:IDs which are in a BioPAX namespace. Note: A consequence of this is that, unlike previous versions of BioPAX, the contents of the BioPAX OWL specification document may not be included in the BioPAX DX instance document. (5)
BioPAX DX instance documents MUST NOT include definitions of classes or properties. (1)
BioPAX DX instance documents MUST owl:import the BioPAX OWL specification document (5)
BioPAX DX instance documents MAY owl:import BioPAX DX definition documents. (3)
BioPAX-as-XML consumers MAY process only the BioPAX DX instance document, ignoring all documents that are owl:imported from it. (1)
BioPAX-as-XML consumers MAY ignore all property values in a BioPAX DX instance document which are the value of a property which is not from a BioPAX Namespace, or defined in the OWL specification. (1)
BioPAX-as-XML consumers MAY ignore rdf:type property values which name a classes whose name is not defined in a BioPAX Namespace. (1)
The rdf:id for xref individuals is a url is generally constructed in the following way (but see DatabaseRegistry for full details) (2,4,6)
The string "
http://xref.biopax.org/xref/" followed by the a canonical database name followed by "#", followed by the database identifier. Here is an example.
<biopax-level2:xref rdf:about=\"http://xref.biopax.org/xref/entrez#7157\">
<biopax-level2:DB rdf:datatype=\"&xsd;string\">entrez</biopax-level2:DB>
<biopax-level2:ID rdf:datatype=\"&xsd;string\">7157</biopax-level2:ID>
</biopax-level2:xref>
The class unificationXref is removed. (consequence of 2,4 and the fact that a relationshipXref and unificationXref might have same DB and ID)
A new property, UNIFICATION-XREF, inverseFunctional, subProperty of XREF is introduced. Briefly, the relationship which was previously expressed as individual XREF unificationXref is now expressed as individual UNIFICATION-XREF xref. (previous and 2)
Requirement 7 Peter Karp's example: "For field X of class C, we use GO terms as our controlled vocabulary".
Solution:
<owl:Ontology rdf:about="">
<TERM-USAGE rdf:resource="#terms1"/>
</owl:Ontology>
<termUsage rdf:ID="terms1">
<USES-TERMS-FROM rdf:datatype="&xsd;string">cco</USES-TERMS-FROM>
<FOR-PROPERTY rdf:resource="#CELLULAR-LOCATION"/>
<ON-CLASS rdf:resource="#physicalEntityParticipant">
</termUsage>
The termUsage instance contains the association relating property (X=CELLULAR-LOCATION) class (C=physicalEntityParticipant) and controlled vocabulary(GO=cco). The property TERM-USAGE, an annotation on the ontology, relates the ontology to the term usage specification.
Here are the details.
A class termUsage is defined: "Used to associate a vocabulary with a property and class that it is used for. A triple of Property, Class, and term source"
An annotation property TERM-USAGE is defined, to relate ontologies to instances of class termUsage.
an annotation property FOR-PROPERTY is defined to relate termUsage instances to properties.
an annotation property USES-TERMS-FROM is defined to relate termUsage instances to the names of term sources. Term sources strings filled with the canonical names of sources as listed on DatabaseRegistry.
an annotation property ON-CLASS to relate termUsage instances to classes.
If a provider intend to have a property filled from more than one source, multiple USE-TERMS-FROM properties MUST be placed on a single instance of termUsage. A termUsage MUST have a single FOR-PROPERTY value. If the terms can be used by multiple classes then a termUsage MAY have multiple ON-CLASS properties. If the ON-CLASS property is not filled, the term is intended to be used wherever the property can be used.
Each value of TERM-USAGE announces the providers intent to fill values of a property with terms from a particular source but carries no inferences. However a BioPAX validator SHOULD flag cases where property have values which do not come from one of the announced sources.
Note: Since all of these properties are annotation properties their domain and range can't be specified in OWL-DL, however we have the freedom to have effective domain and range be anything, including classes and properties.
Peter Karp's examples:
For field Y of class B, we don't use any controlled vocabulary
If you simply use a set of strings, then turn it into a flat list using the method described below for creating a CV, give it a unique name, and register it in DatabaseRegistry, then use the above method.
For field Z of class A, we use our own home-brewed controlled vocabulary
Name and register the vocabulary on DatabaseRegistry and then use the above method.
[[ To be continued ]]
Worked Examples
INOH worked /Inoh to add additional properties.
Open Issues
Expected growth and plan for growth
None. It is expected that subsequent to DX there will be a new proposal, though some elements may be borrowed from this proposal.