Code from an English Coffee Drinker: February 2011

More Ice In Your Tea?

I really shouldn't blog when I'm angry or annoyed as I tend to rant a little more than I intend! In retrospect I was a little harsh in my last post -- anyone who freely gives their time to developing free software shouldn't have to put up with me disparaging their work.

So as penance I've now tracked down the source of the weird class loading bug I highlighted and have submitted a detailed bug report, including a proposed fix, to the IcedTea netx project (netx is the name of the open-source Web Start replacement). You can monitor the progress of the bug through their public bug tracker. If I had the right permissions it's such a simple fix that I'd be happy to do it myself, but you have to earn the respect of project maintainers before getting the right to commit code.

Update, 23th February: it's now been fixed in the main code tree although it will take a while before it makes it into an Ubuntu update.

Why You Shouldn't Drink The IcedTea

I'm all for supporting open-source software but there are limits. I've recently switched to using Ubuntu on my main machine at home and have run into two bugs in the same piece of open-source software.

If you are a regular reader of this blog then you are probably aware that I do most of my software development using Java. A default install of Ubuntu (10.10) includes the OpenJDK based IcedTea version of Java 6. This is a version of Java that is covered by an open-source license -- which is in comparison to the Sun/Oracle version of Java for which you can read the source but which was not covered by an open-source licence (it's now "mostly" covered by GPL v2 with the classpath exception). I've never really understood the philosophical argument behind IcedTea and the need for a clean room implementation of Java, although Oracle's recent attack on Android provides some explanation. Anyway, given that it was the default installation of Java I was willing to give it a try. Within minutes though I'd found two show stopping bugs and so have switched back to using the reliable Sun/Oracle release of Java 6.

The first bug is visual and one that I knew existed in earlier versions of IcedTea but which I hoped had been fixed by now. In essence the ImageIO JPEG reader in IcedTea doesn't properly handle JPEG images with embedded colour profiles. What you end up with is an image that looks like a a photographic negative rather than the image you tried to load. This bug basically means that you can't use IcedTea for any application that allows users to load arbitrary JPEG files. For me this means I can't recommend it for running Convert4Frame, TagME, PhotoGrid or 3DAssembler. Also I can't use IcedTea to run the tomcat server in which I host my cookbook web app. What is really annoying about this bug is that it was originally in the main Sun/Oracle distribution, reported all the way back in 2003, but was fixed in Java 5 update 4, you can read all about it in the bug report. If the open-source version can't fix a bug that is around eight years old then why do they even bother!

The second bug is a little stranger but no less annoying. The documentation for the method ClassLoader.loadClass(String name) states that either it returns the resulting Class object or throws a ClassNotFoundException if (wait for it) the class was not found. That all seems perfectly logical to me. Unfortunately there appears to be at least one situation in which IcedTea returns null instead of throwing an exception when the class cannot be found.

I distribute a lot of the open-source Java software that I develop in my spare time via Web Start and once I had Ubuntu up and running I thought I'd check Java by launching 3DAssembler. Unfortunately it failed to load and gave me a rather strange NullPointerException. After a bit of digging around (the version of the app on my website doesn't match my development version and hence the line numbers were out) I eventually tracked the problem back to this try/catch block.

try {
  Class rmClass = Assemble3D.class.getClassLoader().loadClass("org.jdesktop.swinghelper.debug.CheckThreadViolationRepaintManager");
  RepaintManager.setCurrentManager((RepaintManager)rmClass.getConstructor().newInstance());
  System.err.println("EDT Debug Mode Is Active");
}
catch (ClassNotFoundException e) {
  // the debug classes from SwingHelper are not available
}

This code tries to load a class, via reflection, that catches EDT violations (painting Swing components from the wrong thread) and that I only use during development to aid in debugging. I load the class via reflection so that when I distribute the application I can simply leave out the JAR file containing the debug class and everything will continue to work -- the class isn't found so an exception is thrown, caught and ignored and the application continues on. The problem with IcedTea is that when running as a Web Start application the call to loadClass in line 2 returns null instead of throwing a ClassNotFoundException. This means that the catch block isn't triggered and the exception is thrown all the way out of the main method, killing the application. It seems to only be a Web Start issue as running my development copy locally under IcedTea doesn't cause loadClass to return null. Of course this problem I can fix by changing the catch block to trap all exceptions, but the point is I shouldn't have to!

As I said at the beginning of this post I'm all for open-source software, but I believe there are cases where developers who give their time freely to projects should think more about the merits of the project and if it is really needed. The "official" Oracle release of Java is now, for all intense and purposes, under an open-source license for the development of desktop applications (mobile and embedded uses are a different kettle of fish). Given this, is there really any need for a clean room implementation, especially when that implementation is so buggy as to render it useless in many situations?

What's Actually Worth Reading?

Another day, another GATE processing resource -- as you can tell I've been busy tidying up the PRs that I've developed recently. One of the reasons for this spurt of cleaning and documenting code is that a project I'm currently working on is ending soon and the information extraction pipeline we have developed needs to be fully documented. Being able to just point to multiple sections of the GATE user guide for more details on each PR in the application makes the documentation much easier to write. Of course that means that the PRs have to actually have documentation in the user guide!

I won't go into details about the project I'm currently working on with The National Archives (if you want the details then there was a press release and the head of the GATE group, i.e. my boss, has blogged about it) suffice it to say that it involves processing millions of web pages drawn from hundreds of different web sites.

We can extract an awful lot of information from the web pages we are processing, so much so in fact that it can be difficult to search through the information. We have multiple tools to help with searching but one thing we quickly realised is that it would be nice to ignore information extracted from boilerplate content. Most web pages contain text that isn't really part of the content; headers, menus, navigation links etc. These sections can contain entities that we might extract but it is highly unlikely that they will be relevant to the main content of the page. For this reason it would be nice to be able to exclude these in some way when searching through the extracted information.

The approach we choose was to keep everything extracted using the IE pipeline but to also determine the sections of the document that were actually content. This allows us to search for entities within content. It also means that if our ability to determine what is useful content and what isn't is flawed in any way we have still extracted the entities appearing in other parts of the document.

Rather than implementing a content detection system from scratch I decided to base the PR on an existing Java library called boilerpipe. The boilerpipe library contains a number of different algorithms for detecting content most of which are available through the new GATE PR. There are some features that are not available due to it currently not being possible to map them directly to a GATE document.

To give you a better idea of what the new PR does here is a screen shot of a web page loaded into both a browser and GATE. In the GATE window you can see the pink sections that have been marked as content (click on the image for a larger easier to read version).

Whilst this kind of approach is never going to be perfect it seems, from initial testing, that it does indeed help to filter out erroneous results when searching through information extracted from large web based corpora.

If you want to try it out yourself then it's already in the main GATE svn repository and the nightly builds. Details of how to configure the PR can be found in the relevant section of the GATE user guide.

Numbers Have Real Value

So here is a question for you...

What do the following numbers all have in common? 3^2, 2³, 101, 3.3e3, 1/4, 9^1/2, 4x10^3, 5.5*4^5, thirty one, three hundred, four thousand one hundred and two, 3 million, and fünfundzwanzig.

The answer is that they can all be recognized, annotated and converted to a real number representation (a Java Double) by a new GATE PR that has just been released and that I've just finished documenting for the user guide. You may never have really thought about this before but it turns out that there are so many ways of writing numbers in text that recognising them is actually really quite difficult. If you also want to know the value of the number you have recognised then this adds an extra layer of complexity especially when the number is written out in words rather than digits.

The PR actually started life back in 2009 for recognising numbers in patent documents as a precursor to recognising and normalizing measurements but since then has seen lots of development to extend the range of numbers that can be recognised. This new version is being used on a number of projects both to recognise numbers simply for the sake of finding numbers but also to help find drug doses, government spending and lots of generic measurements.

Requests for code to recognising numbers and determine their value has cropped up a number of times on the GATE mailing list and whilst we had been using this code internally for a while we knew that there were issues with it and it had never been tidied up or documented to the extent where we would be happy to show it to other people! Having discovered yet-another-bug in the code a fortnight ago I decided to take the time to rewrite large chunks of the code in order to fix most of the outstanding issues and to increase the range of numbers we could recognise. Hopefully this has led to a more useful PR. If you'd like to try it out then you can find this PR in the Tagger_Numbers plugin within the main GATE svn repository and it's in the nightly builds as well.

The plugin actually contains two PRs; Numbers Tagger and Roman Numerals Tagger. As you can guess by the name this second PR annotates Roman numerals appearing in documents. As with the main PR this also calculates the numeric value of the Roman numerals. I'm guessing that this PR is probably less useful than the main Numbers Tagger but we have found it to be helpful in the past when trying to recognise document sections, tables, figures etc. which can often be labelled with Roman numerals instead of Arabic numbers, e.g. Section VI, Table IV, Figure IIIa. If you are interested in the Roman Numerals Tagger then you can find more details in the user guide.