Archive for November, 2006

Java Shared Mime Info

Wednesday, November 8th, 2006

jmimeinfo has moved please update your links!

I’ve been looking around for a java library that will do a similar task to the unix file command. The reason I need such a library is because, shortly, I’ll be allowing the upload of files from the public to a website of mine. When uploading a file via http it is encapsulated as a multipart message which includes the mime type. However what mime type gets put in the multipart is entirely dependent upon the uploaders browser and/or operating system. So if the user uploads a flash movie file but their system is unaware of the flash movie format the mime-type will likely be set to default such as application/octet-stream which basically tells me that it is a bunch of binary stuff - not much use.

Windows also provides a number of headaches (just for a change) in this areas, as every program installed on the machine is allowed to set mime-types at the OS level there is often some confusion. So depending on what programs a user has installed the same file maybe encapsulated with a different mime-type. For example on a Windows system I recently looked at there was a file called video.mpg, the *.mpg extension was associated with the movie/quicktime mime-type and upon looking at the file in a hex editor it was actually an RIFF based avi file - as a colleague of mine would say ‘genius’.

So this is how Windows deals with things, how do other OSes do it? Well I’m glad you asked. Linux or more correctly freedesktop.org which produces standard and software for numerous OSes (Linux, BSD, Solaris, HP-UX, AIX etc) have a standard for this, it is called ‘Shared Mime Info‘. Shared Mime Info has a specification and a reference implementation (in ANSI C). Shared Mime Info uses what are colloquially know as magic numbers and globbing to determine a file’s type.

Magic numbers simplistically work by looking for sequences of byte patterns in a file. So for example a java class file starts with the hex string ‘CA FE BA BE’ or an access database file starts with a normal string ‘\x00\x01\x00\x00Standard Jet DB’ or to pick a more complex example ELF executables start with ‘\177ELF’ and then 5 bytes in have the byte 1 then a further 16 bytes in have a little-endian byte 16.

Globbing is just a simple form of regular expressions matched against a file path. So for example the glob for a word document would be *.doc and the glob of a GNU make file would be [Mm]akefile.

Shared Mime Info also goes one step further providing plain English (or anyone of 30 other languages) description of each file format as well as suggesting applications that can deal with the file.

Anyway, so Java doesn’t have an implementation of Shared Mime Info… until now. I’ve timeboxed myself 2 hours over 2 days to put this together, it is a basic implementation of Shared Mime Info for Java, currently the code is a messy and it isn’t a full implementation yet, but it does provide a full implementation of magic and glob based mime identification and it’s licensed under the GPL. It requires Java SE 5 or higher.

The source and compiled jars can be downloaded here:

jmimeinfo-0.0.1.tar.gz (08-Nov-2006)

jmimeinfo-0.0.2.tar.gz (12-Nov-2006)
Fixed concurrency issue, I (stupidly) assumed that Matchers in ORO are threadsafe.

jmimeinfo-0.0.3.tar.gz (20-Apr-2007)
Fixed problem with RandomAccessFiles not being closed under certain conditions in net.hedges.mimeinfo.magic.Rule - Thanks to Ryan and KewlCat for point that out (see comments below)

To determine the mime type of file code the following:


File file = new File("/path/to/my/file.ext");
MimeInfo mimeInfo = new MimeInfo();
System.out.println("mime-type: " + mimeInfo.getMimeType(file));

To test the library run the following from the expanded archive:


java -cp jmimeinfo-0.0.1.jar:lib/jakarta-oro-2.0.8.jar net.hedges.mimeinfo.ui.Test /path/to/file.ext

All comments and feedback welcome as well as patches and bug reports :)

Technorati Tags: , ,

My Excellent Host

Tuesday, November 7th, 2006

I use LayeredTech as my unmanaged hosting provider. My hobby site The Humor Archives is hosted there on a Linux box. Recently I had problems, possibly of my own causing, I’m not sure yet, that required a hard reboot (i.e. have you tried turning it off and on again). So I submitted a ticket thinking it would be a few days before it was actioned… 17 minutes later a message from their tech gals:

Hello,

In progress

Thanks

Rebecka
Layeredtech Technologies - Support 1

Now that is service! Also bear in mind that it is 2 in the morning where they are. I’m very please with them, and have already recommended them to people… I wonder if they have a referral scheme.