In Out, In Out, Shake It All About

In the very abstract sense text analysis can be divided into three main tasks; load some text, process it, export the result. Out of the box GATE (both the GUI and the API) provides excellent support for both loading documents and processing them, but until now we haven't provided many options when it comes to exporting processed documents.

Traditionally GATE has provided two methods of exporting processed documents; a lossless XML format that can be reloaded into GATE but is rather verbose, or the "save preserving format" option which essentially outputs XML representing the original document (i.e. the annotations in the Original markups set) plus the annotations generated by your application. Neither of these options were particularly useful if you wanted to pass the output on to some other process and, without a standard export API, this left people having to write custom processing resources just to export their results.

To try and improve the support for exporting documents recent nightly builds of GATE now include a common export API in the gate.DocumentExporter class. Before we go any further it is worth mentioning that this code is in a nightly build so is subject to change before the next release of GATE. Having said that I have now used it to implement exporters for a number of different formats so I don't expect the API to change drastically.

If you are a GATE user, rather than a software developer, than all you need to know is that an exporter is very similar to the existing idea of document formats. This means that they are CREOLE resources and so new exporters are made available by loading a plugin. Once an exporter has been loaded then it will be added to the "Save as..." menu of both documents and corpora and by default exporters for GATE XML and Inline XML (i.e. the old "Save preserving format) are provided even when no plugins have been loaded.

If you are a developer and wanting to make use of an existing exporter, then hopefully the API should be easy to use. For example, to get hold of the exporter for GATE XML and to write a document to a file the following two lines will suffice:
DocumentExporter exporter =
   DocumentExporter.getInstance("gate.corpora.export.GateXMLExporter");

exporter.export(document, file);
There is also a three argument form of the export method that takes a FeatureMap that can be used to configure an exporter. For example, the annotation types the Inline XML exporter saves is configured this way. The possible configuration options for an exporter should be contained in it's documentation, but possibly the easiest way to see how it can be configured is to try it from the GUI.

If you are a developer and want to add a new export format to GATE, then this is fairly straightforward; if you already know how to produce other GATE resources then it should be really easy. Essentially you need to extend gate.DocumentExporter to provide an implementation of it's one abstract method. A simple example showing an exporter for GATE XML is given below:
@CreoleResource(name = "GATE XML Exporter",
   tool = true, autoinstances = @AutoInstance, icon = "GATEXML")
public class GateXMLExporter extends DocumentExporter {

  public GateXMLExporter() {
    super("GATE XML", "xml", "text/xml");
  }

  public void export(Document doc, OutputStream out, FeatureMap options)
          throws IOException {
    try {
      DocumentStaxUtils.writeDocument(doc, out, "");
    } catch(XMLStreamException e) {
      throw new IOException(e);
    }
  }
}
As I said earlier this API is still a work in progress and won't be frozen until the next release of GATE, but the current nightly build now contains export support for Fast Infoset compressed XML (I've talked about this before), JSON inspired by the format Twitter uses, and HTML5 Microdata (an updated version of the code I discussed before). A number of other exporters are also under development and will hopefully be made available shortly.

Hopefully if you use GATE you will find this new support useful and please do let us have any feedback you might have so we can improve the support before the next release when the API will be frozen.

0 comments:

Post a Comment