Java Shared Mime Info

jmimeinfo has moved please update your links!

I’ve been looking around for a java library that will do a similar task to the unix file command. The reason I need such a library is because, shortly, I’ll be allowing the upload of files from the public to a website of mine. When uploading a file via http it is encapsulated as a multipart message which includes the mime type. However what mime type gets put in the multipart is entirely dependent upon the uploaders browser and/or operating system. So if the user uploads a flash movie file but their system is unaware of the flash movie format the mime-type will likely be set to default such as application/octet-stream which basically tells me that it is a bunch of binary stuff - not much use.

Windows also provides a number of headaches (just for a change) in this areas, as every program installed on the machine is allowed to set mime-types at the OS level there is often some confusion. So depending on what programs a user has installed the same file maybe encapsulated with a different mime-type. For example on a Windows system I recently looked at there was a file called video.mpg, the *.mpg extension was associated with the movie/quicktime mime-type and upon looking at the file in a hex editor it was actually an RIFF based avi file - as a colleague of mine would say ‘genius’.

So this is how Windows deals with things, how do other OSes do it? Well I’m glad you asked. Linux or more correctly freedesktop.org which produces standard and software for numerous OSes (Linux, BSD, Solaris, HP-UX, AIX etc) have a standard for this, it is called ‘Shared Mime Info‘. Shared Mime Info has a specification and a reference implementation (in ANSI C). Shared Mime Info uses what are colloquially know as magic numbers and globbing to determine a file’s type.

Magic numbers simplistically work by looking for sequences of byte patterns in a file. So for example a java class file starts with the hex string ‘CA FE BA BE’ or an access database file starts with a normal string ‘\x00\x01\x00\x00Standard Jet DB’ or to pick a more complex example ELF executables start with ‘\177ELF’ and then 5 bytes in have the byte 1 then a further 16 bytes in have a little-endian byte 16.

Globbing is just a simple form of regular expressions matched against a file path. So for example the glob for a word document would be *.doc and the glob of a GNU make file would be [Mm]akefile.

Shared Mime Info also goes one step further providing plain English (or anyone of 30 other languages) description of each file format as well as suggesting applications that can deal with the file.

Anyway, so Java doesn’t have an implementation of Shared Mime Info… until now. I’ve timeboxed myself 2 hours over 2 days to put this together, it is a basic implementation of Shared Mime Info for Java, currently the code is a messy and it isn’t a full implementation yet, but it does provide a full implementation of magic and glob based mime identification and it’s licensed under the GPL. It requires Java SE 5 or higher.

The source and compiled jars can be downloaded here:

jmimeinfo-0.0.1.tar.gz (08-Nov-2006)

jmimeinfo-0.0.2.tar.gz (12-Nov-2006)
Fixed concurrency issue, I (stupidly) assumed that Matchers in ORO are threadsafe.

jmimeinfo-0.0.3.tar.gz (20-Apr-2007)
Fixed problem with RandomAccessFiles not being closed under certain conditions in net.hedges.mimeinfo.magic.Rule - Thanks to Ryan and KewlCat for point that out (see comments below)

To determine the mime type of file code the following:


File file = new File("/path/to/my/file.ext");
MimeInfo mimeInfo = new MimeInfo();
System.out.println("mime-type: " + mimeInfo.getMimeType(file));

To test the library run the following from the expanded archive:


java -cp jmimeinfo-0.0.1.jar:lib/jakarta-oro-2.0.8.jar net.hedges.mimeinfo.ui.Test /path/to/file.ext

All comments and feedback welcome as well as patches and bug reports :)

Technorati Tags: , ,

19 Responses to “Java Shared Mime Info”

  1. marvi Says:

    What about jmimemagic?

    That project could use some help actually:

    http://sourceforge.net/projects/jmimemagic/

    Btw, I had to register to be able to post a comment. The captcha stuff doesn’t seem to work. There is no image.

  2. andy Says:

    Yes, I looked at jmimemagic and Sun’s activation framework too. However neither support many file formats and neither is based on a standard with as much support as Shared Mime Info. The Shared Mime Info database supports several hundred file formats and 30 natural languages and is growing.

    BTW captcha stuff worked just now when I tested it. The image takes a little while to load, slow server I’m afraid.

  3. Hani Suleiman Says:

    One has to wonder ,why the oro dependency if you’re using jdk 5?

    Otherwise, good idea!

  4. andy Says:

    @Hani

    Java does not support globbing, which is different from the Perl5-like regular expressions support by java.util.regex.Pattern since Java 1.4 - unless I’m missing something, which is entirely possible.

  5. Rod Says:

    I have been using JMimeMagic, but it seems to have been abandoned. There are a couple of nice things about it. First, it is easy to prune the magic rules file down to only the mime types that you are concerned with, and reorder it, which greatly speeds up an attempted match. Secondly, it is easy to add your own magic rules for esoteric file types.

    Good luck.

  6. andy Says:

    @Rod

    Yes, jmimemagic doesn’t seem to have had an update for a while. I’m planning on adding the database manipulation features so hopefully that will useful to you. However there are a few other things to do first like being able to return the file meta info for each type.

    If anyone gets there before me with the database manipulation please feel free to send in the patches.

    You can currently create any database you like as you described by editing the XML master file and running the database compile program from freedesktop.org.

  7. Joshua Says:

    Great work! I’d love to see a URL or inputstream method as well for getting mime types on externally hosted files.

  8. arimus Says:

    Actually jMimeMagic isn’t abandoned, although it was slow going for awhile. I just released 0.1.0 a couple months ago which added some hinting for extensions to greatly speed up matching and now supports customer matchers.

    And yeah, I know folks want to see more matchers out of the box. I’m gonna have to dig in again and fix my converter for the unix magic file and get a more comprehensive set of matchers out of the box. Would love to see matcher contributions and suggestions get fed into jMimeMagic.

    Cheers

  9. Ryan Says:

    Thanks for creating and publishing the jmimeinfo package. I am using it and it is very helpful. I think I found a bug. Perhaps it’s just in the way I am using it but it seems that within Rule.java inside match(File file), RandomAccessFile raf does not get closed when there is a successful match. This preventing me from deleting files when I find an available but inappropriate mime type. As a result I added a try/finally block to the function. Let me know what you think.

    My match function:
    public boolean match(File file) throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, “r”);
    try {
    for (int offset = startOffset; offset

  10. KewlCat Says:

    Same problem here : files get parsed and identified but not closed properly, so any further action on the file is prohibited :-/
    I tried to add the two missing “raf.close()” before both returns in Rule.java (lines 133 and 140) but when I recompiled (using Eclipse and the build script) and then tried to use my own version, there was a NullPointerException : net.hedges.mimeinfo.magic.MagicFile.parseMagic(MagicFile.java:51).
    Apparently it needs a MIME-Magic that’s not provided in the archive :-(

    Could somebody please create and provide a clean jmimeinfo-0.0.3, or a way to make it ?

  11. KewlCat Says:

    Wooops… Sorry about that. I _had_ to place the directory named “jmimeinfo/” inside the Eclipse workspace in order to make Ant work as intended. It works ! :-)

  12. andy Says:

    Thanks Ryan and KewlCat problem has been fixed and new version 0.0.3 is available above.

  13. emmanuel Says:

    Hello,

    jmimeinfo works very well, and it is a good tool, but not for PSD files, fingerprint is not correct and jmimeinfo does not recognize PSD files… How to update mime magic file ?

  14. emmanuel Says:

    for psd files : 2 bugs…
    first :
    in MagicFile.java function readLine()
    byte[] buff = new byte[valueLength];
    in.readFully(buff);
    line.setValue(buff);

    b = in.read();
    if (b == ‘&’) {
    buff = new byte[valueLength]; // this must be added otherwise value is replaced with mask buffer
    in.readFully(buff);
    line.setMask(buff);
    b = in.read();
    }
    second:
    in Rule.java function Mask()
    private void mask(byte[] sample){
    for(int i = 0; i

  15. Renzo Says:

    Hi, I’m just looking at jmimeinfo for detecting mime type from blobs as fetched from an own db. Thus I have memory byte arrays, not files. Can I use it anyway ?
    Also does the GPL licence imply that we must make all other proprietary application jars available in source ?
    Thanks.

  16. andy Says:

    Emmanuel, I’ll look into this when I get a minute. Thanks for posting the issue and solution.

  17. andy Says:

    Renzo,

    It shouldn’t be hard to make the change to the code necessary. If you could send me the changes that would be great! I’ll probably be adding this feature soon but can’t guarantee timescale

    As for the license issues you should consult with your legal people.

  18. arimus Says:

    Sorry, been afk due to some job switching ;) Yeah, if you provide me with any patches you guys have arimus AT users.sourceforge.net, I can merge in the changes for folks. I only ask that you give me the okay to dual-license any contributions…I may switch to another license (Apache perhaps) at some point here.

    As for licensing, I am currently licensing under the LGPL, which is not a viral license and does not require you to release your code at all. It only requires modifications to jmimemagic itself be opened up to those you distribute to.

    jMimeMagic can handle both files and byte[]. Currently streams are not supported, but that wouldn’t be too hard to add.

  19. Matt Says:

    First - Nice library!

    Second - I have users that create Excel files and then later down the road save them with a .csv extension. I know…I know…why they do this is beyond this question. When I use version 0.0.6 on this type of file it reports it as a text/csv. Opening the file in a text editor clearly shows that this is not a csv file. Can you give a reason why it is returning “text/csv” instead of “application/vnd.ms-excel”?

    Thanks!

    Matt

Leave a Reply