Browser Detection: How Not To Do It

Web browsers used to behave so differently from one another that it was common to write code to detect which browser type and version was being used to display the page so that appropriate code could be run. Fortunately, modern browsers tend to support web standards better and so apart from a few CSS tweaks it is unusual to come across browser detection code (the one exception to this rule being the expense system I have to use at work which doesn't seem to like Firefox at all). So I was a little surprised the other day when I suddenly found myself bounced to a an unsupported browser page on a shopping site that I'd been browsing around a few days before. So being the inquisitive kind I had a dig around and came across this gem.
if (navigator.userAgent.match(/Firefox\/[12]/)) {
   window.location.href = "unsupported.html";
}
For those of you who don't speak JavaScript, this essentially checks (using a regular expression) for the presence of either "Firefox/1" or "Firefox/2" in the useragent string, which identifies the make and version of the web browser requesting the page.

Now I like living on the bleeding edge of browser development and so I run the nightly builds of Firefox. Given the recent change in the release cycle of Firefox, the version number has climbed quite rapidly and the nightly build now sends the following useragent string (you can find out where your browser sends using this website):

Mozilla/5.0 (X11; Linux x86_64; rv:10.0a1) Gecko/20111012 Firefox/10.0a1

So from reading this you should be able to see that I'm actually running version 10.0a1 of Firefox. It should also be clear why the check of my browser resulted in me being bounced to the unsupported browser page: Firefox/10.0a1 contains the string "Firefox/1". This is a good example of why you really shouldn't write your own browser detection code, especially as there are a number of well written and up to date scripts out there that correctly extract the make and version number and which are easy to use. And if you subscribe to the not-invented-here school of thought, then at least make sure you actually implement a sensible solution!

Postvorta: Providing Intelligent Blog Search

The eagle-eyed amongst you may have noticed that about a month ago the search box in the sidebar of this blog changed. I used to use the standard Google search gadget but I now use a gadget powered by Postvorta.

Postvorta was built specifically to enable intelligent searching of blogs. How do I know this you ask? Well I spent the past year building Postvorta in my spare time. The initial motivation was a number of conversations with fellow bloggers about the inadequacies of the Google search gadget and coupled with the fact that my job involves processing natural language documents (I work as part of the GATE group at the University of Sheffield) I thought I was in a position to provide something better.

It is difficult to know exactly how the standard Google search gadget works, but as far as I can tell (both from personal experimentation and from talking to others) it appears to only index the main content of each post. For example, it certainly doesn't index the labels associated with posts. This means that while you can view all posts with a given label you can't search for them using the search gadget. Postvorta, however, indexes all the important content from your blog posts: title, article, labels, and comments. Importantly it does not index the pages you see when you view the blog in a web browser, instead it access the underlying data (via the Google Data APIs) which means that it can ignore the repeated information in the blog template. For example, many blogs contain a gadget which lists recent post titles, these shouldn't be indexed with each post as that makes it much more difficult to search for the actual post. A search can also be restricted by date and/or by the people who commented on a post. I've tried to provide as much flexibility as possible while keeping the full interface relatively simple.

Fortunately when building Postvora I didn't have to start from scratch. One advantage of working in a research group that makes their software available under an open-source license is that I can make use of software I use at work in my own projects. In this case the main indexing and search facilities behind Postvorta are built upon GATE Mímir. I've talked about Mímir before on this blog and if you've read that post then you shouldn't be surprised that as well as searching for words, like Google and the other search engines, you can use Postvorta to search your blogs semantically, i.e. for things. So you can search for any posts containing, for example, the name of a person without knowing what the name was in advance. If you are new to Mímir then Postvorta provides a comprehensive description of the query syntax which becomes available when you choose to use it through the search interface (by default searches are treated as a simple bag-of-words just as with other search engines).

Feel free to have a play with Postvorta through the search gadget on this blog. I'm also using it on my main blog where there are a lot more posts to search through. Postvorta is currently being run as a closed beta (while I evaluate performance, reliability etc.) but if you like what you see then you can register your interest and I'll try and index your blog as soon as possible -- note that currently Postvorta only supports Blogger blogs, although WordPress support should be coming soon.

Let me know what you think.

Blogger's Lightbox Returns To Haunt Our Blogs!

Blogger have now reintroduced lightbox to all blogs. While they claim to have fixed a lot of the bugs/issues that were reported before, they have still turned lightbox on by default. Fortunately we don't need to apply a hacky workaround this time.

If you don't want to use lightbox then you can turn it off for each blog you write through your dashboard. Simply select “No” next to Lightbox in the Settings > Posts and Comments section (new interface) or the Settings > Formatting section (old interface). I still think this feature should be opt-in rather than opt-out but at least they are allowing us to opt-out, so I guess we should be grateful.

Don't Look Down

And now for something completely different -- a posting not at all related to Blogger!

I've recently been spending quite a bit of my free time playing around with GATE Mímir (the reasons for which will become clear in a later post). As I've mentioned before, Mímir is a multi-paradigm indexing and retrieval system which allows us to combine text, annotations and knowledge base data in a single index. Text within Mímir is indexed using MG4J and by default is processed (at both indexing and search time) by a DowncaseTermProcessor which ensures that searches are case insensitive. Unfortunately while case insensitive searching is great there are other common problems when searching text collections, one of which can be nicely illustrated just from the name Mímir.

Whilst the name of the system, Mímir, contains an accented character, most people when searching would probably not go to the bother of figuring out how to enter the accented i and would instead try searching for Mimir. But just as Mímir and Mimir are visually different, so are they different when stored in an MG4J index. In other words if we search using the unaccented version we won't get any results! Whilst Mímir is a slightly unusual case I'm sure that we can all agree that a search for cafe should also bring back documents which mention café.

Now for latin alphabets I could come up with a mapping that would reduce most accented characters down to an unaccented version, but it would be time consuming to build and wouldn't handle the different ways in which accented characters can be encoded using Unicode. So I had a bit of a hunt around and discovered a simple, and I think, elegant way of converting accented characters to their unaccented forms courtesy of a blog posting by Guillaume Laforge. Creating a custom MG4J term processor using this code was trivial and so I now have a way of ensuring that accented characters don't cause me any problems. The one issue was getting Mímir to use the new term processor.

I deploy Mímir in a Tomcat instance by building a WAR file and while I could simply add a JAR file containing my custom term processor to the WEB-INF/lib folder before creating the WAR I'd prefer not to have to. If i included my code within the Mímir WAR then anytime I wanted to make a change would require rebuilding the WAR and redeploying which seems to be more work than necessary. Fortunately the Mímir config file allows you to specify GATE plugins that should be loaded when the web app is started. So it is trivial to create a GATE plugin which references a JAR containing my custom term processor. Unfortunately when I tried this, MG4J threw a ClassNotFoundException. The problem is that Java never looks down.

On the right you can see the ClassLoader hierarchy that is created when Mímir is deployed in Tomcat -- I've added little icons to show which are created by the Java runtime environment, which by Tomcat and which by the GATE embedded within Mímir. As you can see the GATE classloader, which is responsible for loading the plugin containing my custom term processor, is right at the bottom of the hierarchy. The MG4J libraries in the Mímir WEB-INF/lib folder which is the responsibility of the Web App classloader. Each classloader only knows about it's parent and not about any children and when asked to load a class first asks it's parent classloader and only if the class cannot be loaded does it then try loading it itself. The problem I was facing was that when MG4J tried to load my custom term processor it did so by asking the Web App classloader and as it is loaded by a child classloader the class couldn't be found and hence a ClassNotFoundException was thrown. Rather than giving up and simply adding the term processor to the WEB-INF/lib folder I decided to see if I could find a way of injecting the term processor into the right classloader.

Now before we go any further I should point out that one of my collegues has described what follows as evil, and I have to say I agree with him. That said it works and given the way Mímir works I can't see any problems arising, but I wouldn't suggest this as a general solution to the class loading problem described above for reasons I'll detail later. However....

Each class in Java knows which ClassLoader instance was responsible for creating it and we can use this information to forcibly inject code into the right place, using the following method.

private static void codeInjector(Class<?> c1, Class<?> c2)
{
  try
  {
    // Get the class loader which loaded MG4J
    ClassLoader loader = c2.getClassLoader();

    if (!loader.equals(c1.getClassLoader()))
    {
      //Assuming we aren't running inside the MG4J class loader...

      //Get an input stream we can use to read the byte definition of this class				
      InputStream inp = c1.getClassLoader().getResourceAsStream(c1.getName().replace('.', '/') + ".class");

      if (inp != null)
      {
        //If we could get an input stream then...

        //read the class definition into a byte array
        byte[] buf = new byte[1024 * 100]; //assume that the class is no larger than 100KB, this one is only 3.5KB
        int n = inp.read(buf);
        inp.close();

        //get the defineClass method
        Method method = ClassLoader.class.getDeclaredMethod("defineClass", String.class, byte[].class, int.class, int.class);

        //defineClass is protected so we have to make it public before we can call it
        method.setAccessible(true);

        try
        {
          //call defineClass to inject ourselves into the MG4J class loader
          method.invoke(loader, null, buf, 0, n);
        }
        finally
        {
          //set the defineClass method back to being protected
          method.setAccessible(false);
        }
      }
    }
  }
  catch (Exception e)
  {
    //hmm, something has gone badly wrong so throw the exception
    throw new UndeclaredThrowableException(e, "Unable to inject " + c1.getName() + " into the same class loader as " + c2.getName());
  }
}

In essence this method injects the definition of class c1 into the classloader responsible for class c2. So in my term processor I call it from a static initializer as follows:

static
{
  codeInjector(NormalizingTermProcessor.class,
    it.unimi.dsi.mg4j.index.Index.class);
}

So how does this all work. Well hopefully the code comments will help but...
  1. Firstly we check that the classes are defined by different classloaders (lines 6-8)
  2. Then we convert the class name into the path to the class file and try and open a stream to read from that file (line 13). If we can't read the class file then it means we have already injected the class which is why the classloader can't find the class file.
  3. We then read the class file into a byte array (lines 20-22)
  4. To inject code into a classloader we need to use the defineClass method, which unfortunately is protected. So we retrieve a handle to the method and remove the protected restriction (lines 25-29)
  5. We now call deifneClass on the classloader we want to know about the class passing in the bytes we read in from the original class file (line 33)
  6. Finally we put the protected restriction back so we leave things as they were when we found them (line 38)
Now there are a couple of things to be aware of which could trip you up if you try and do something similar:
  1. If there is a security manager in place then you may find that you can't call the defineClass method even when the protected restriction is removed.
  2. This code will result in the same class being defined in two classloaders (which after all was the whole point) but instances of the class cannot be shared between the classloaders. If you try to you will get an exception (can't actually remember which one).
Neither of these seem to be an issue with loading custom MG4J term processors into Mímir, so this seems to be a nice, albeit evil, way of allowing me to add functionality without having to add to the Mímir WAR file. Success!