Late 2005 Proposal for BioPAX Development/Discuss

DanCorwin, 4/10

I have posted a note on Self-Annotating Identifers - a new way to model BioPAX vocabulary that could speed and simplify the first two steps of the 4-part conversion process below. Step one would more extensively document BioPAX ontology IDs in ways relevant to suppliers. Step two would create a web API to accept and validate supplier-provided triples of them.

A good workplan for speeding up DB conversions, as I see it, should look something like this, all of which rests ON TOP OF any required upgrades to the 2.0 ontology:

1) DX will solicit stakeholder support and create initial BioPAX SAIDs for 2.x

2) If they do, I can supply a few JSPs to capture triples, with simple validation

3) Once they work, suppliers can write and call triple-uploading utility scripts

4) As they do, SW templates can map logs of such triples into proper BioPAX RDF

These four tasks comprise a distributed resource conversion effort. It is not (yet) vast in scale, but like any other such project, to succeed it needs requirements, workplans, staff, design docs, release schedules, hosting, programming, Q/A, tech writing, and management. The goal should be a data conversion web site to provide ongoing support to BioPAX suppliers.

Stakeholders in the DX plan, especially consumers, should collectively sponsor this. Today, without it, each supplier is expected to convert data alone, planning and doing all the steps with no organized central help. Nobody should wonder this process keeps failing. You get what you pay for.

By leading parallel efforts on the tasks outlined above, consumers could soon start getting RDF files of supplier triples that were BioPAX 2.x compliant. They would generate feedback and examples that paved the way for many more. The trick is to view bulk data conversion as the industrial scale project which it properly is, and back it with tools and resources tailored to the task. The DXTrackWorkplan goals won't succeed unless that also occurs.

---

Hi Dan - the main issue, as you put it is resources. I would argue that the current process for conversion is nearly as efficient as possible given the minimal resources currently available. This is because the data providers are the best at converting their data to other formats once a mapping is defined. The problems for conversion are:

  1. resources

  2. defining a mapping from the private data model to BioPAX (this requires BioPAX knowledge and can be efficiently done by knowledgeable, active BioPAX community members)

  3. writing a converter (there are many ways to do this and the choice is almost 100% dependent on the ability/knowledge of the developers assigned to the task - e.g. 50% of the database groups so far only have the ability to work with basic Perl or Python scripts and can't use XML based tools and basically none have been able to use Java, which is the language with the most amount of libraries for working with OWL/XML.

  4. validating the data after conversion - this requires knowledge of BioPAX and can be done by the community (as has already been done)

  5. making it available for users to download via FTP or web. To put this into perspective, this part can take up to 6 months because the decision makers are busy.

It appears that the above suggestion addresses the conversion, but that is usually the least time consuming part of the process and easiest to do if a developer is available. Sure, as libraries for BioPAX mature, there will be more incentive for database groups to use these libraries rather than rolling their own solution. Right now, they can roll their own solution in one quarter of the time.

---

Hi Gary - 5/6 Thanks for the reply, and a good write-up on the current workflow. But far from being near-optimum, IMHO, it is a good checklist on broken steps in the process. Today, the middle step (3) is sandwiched in between two others BioPAX dictates. That virtually guarantees many suppliers will do nothing on it:

Step 3 needs to be split so that each provider deals with two simpler "converters". Let's call them CONVERTER-3A and CONVERTER-3B. Built by different people, they are the technical core of what Self-Annotating Identifers proposes.

Splitting Step 3 also hugely simplifies Step 2. It puts the data provider in change of mapping the data, and BioPAX.org in charge of releasing decent doc/tools to help. BOTH sides win on resources issues:

The big win, however, is at Step 5. With better data conversion workplans, able to publicly show incremental progress from many sources, decision makers get more incentives to commit their own time/people to helping, more credit for what they deliver, and lower risks. BioPAX.org no longer micromanages every part of the work, so they get more willing to act as our partners.

last edited 2006-06-06 13:04:39 by DanCorwin