Open Text Mining Interface

      1 Comment on Open Text Mining Interface

Nature magazine is suggesting a new “standard” for use by publishers (and others) to expose content to open text mining applications—and the March 23rd issue is available for testing. From Nascent, Nature’s blog on web technology and science, Tim Hannay presents an interesting look and a bit of documentation for their proposed OTMI (Open Text Mining Interface) standard. The illustration for this post is a snippet of a screen grab from the XML file of an article using the OTMI syntax:

Every now and then a scientist contacts Nature asking for a machine-readable copy if our content (i.e., the XML) to use in text-mining research. We’re usually happy to oblige, but there has to be a better way for everyone concerned, not least the poor researcher, who might have to contact any number of publishers and deal with many different content formats to conduct their work. Much better, surely, to have a common format in which all publishers can issue their content for text-mining and indexing purposes.

The Open Text Mining Interface (OTMI) is a suggestion from Nature about how we might achieve that. As described in my earlier post, I presented a brief summary of the idea at the Bio IT World conference in Boston. We’ve since been sharing the idea with some other publishers. This post is intended to provide a few written details and an update.

Our initial demo uses the 23 March issue of Nature (by happy coincidence, a wonderful special issue on the future of scientific computing). Embedded in the HTML of the abstract and full-text file for each article is a tag like this:

<link rel=“OTMI” type=“application/atom+xml” xhref=“../otmi/otmi-nature04614.xml” />

which points to an OTMI file — a machine-readable representation of the text.

Yes, this is pretty cool but also a bit chilling. The basic idea here (as I understand it) is a mechanism to enable machines to “read” and “understand” (that is, “mine and/or index”) content but at the same time, make sure that humans can’t simply grab the content and derive understanding (e.g., read it). Looked at another way, isn’t this a form of DRM?

If you read the comments following the Nascent post, one reader makes an interesting connection to this work:

http://www.starshiptitanic.com/novel/