Java Shared Mime Info
jmimeinfo has moved please update your links!
I’ve been looking around for a java library that will do a similar task to the unix file command. The reason I need such a library is because, shortly, I’ll be allowing the upload of files from the public to a website of mine. When uploading a file via http it is encapsulated as a multipart message which includes the mime type. However what mime type gets put in the multipart is entirely dependent upon the uploaders browser and/or operating system. So if the user uploads a flash movie file but their system is unaware of the flash movie format the mime-type will likely be set to default such as application/octet-stream which basically tells me that it is a bunch of binary stuff - not much use.
Windows also provides a number of headaches (just for a change) in this areas, as every program installed on the machine is allowed to set mime-types at the OS level there is often some confusion. So depending on what programs a user has installed the same file maybe encapsulated with a different mime-type. For example on a Windows system I recently looked at there was a file called video.mpg, the *.mpg extension was associated with the movie/quicktime mime-type and upon looking at the file in a hex editor it was actually an RIFF based avi file - as a colleague of mine would say ‘genius’.
So this is how Windows deals with things, how do other OSes do it? Well I’m glad you asked. Linux or more correctly freedesktop.org which produces standard and software for numerous OSes (Linux, BSD, Solaris, HP-UX, AIX etc) have a standard for this, it is called ‘Shared Mime Info‘. Shared Mime Info has a specification and a reference implementation (in ANSI C). Shared Mime Info uses what are colloquially know as magic numbers and globbing to determine a file’s type.
Magic numbers simplistically work by looking for sequences of byte patterns in a file. So for example a java class file starts with the hex string ‘CA FE BA BE’ or an access database file starts with a normal string ‘\x00\x01\x00\x00Standard Jet DB’ or to pick a more complex example ELF executables start with ‘\177ELF’ and then 5 bytes in have the byte 1 then a further 16 bytes in have a little-endian byte 16.
Globbing is just a simple form of regular expressions matched against a file path. So for example the glob for a word document would be *.doc and the glob of a GNU make file would be [Mm]akefile.
Shared Mime Info also goes one step further providing plain English (or anyone of 30 other languages) description of each file format as well as suggesting applications that can deal with the file.
Anyway, so Java doesn’t have an implementation of Shared Mime Info… until now. I’ve timeboxed myself 2 hours over 2 days to put this together, it is a basic implementation of Shared Mime Info for Java, currently the code is a messy and it isn’t a full implementation yet, but it does provide a full implementation of magic and glob based mime identification and it’s licensed under the GPL. It requires Java SE 5 or higher.
The source and compiled jars can be downloaded here:
jmimeinfo-0.0.1.tar.gz (08-Nov-2006)
jmimeinfo-0.0.2.tar.gz (12-Nov-2006)
Fixed concurrency issue, I (stupidly) assumed that Matchers in ORO are threadsafe.
jmimeinfo-0.0.3.tar.gz (20-Apr-2007)
Fixed problem with RandomAccessFiles not being closed under certain conditions in net.hedges.mimeinfo.magic.Rule - Thanks to Ryan and KewlCat for point that out (see comments below)
To determine the mime type of file code the following:
File file = new File("/path/to/my/file.ext");
MimeInfo mimeInfo = new MimeInfo();
System.out.println("mime-type: " + mimeInfo.getMimeType(file));
To test the library run the following from the expanded archive:
java -cp jmimeinfo-0.0.1.jar:lib/jakarta-oro-2.0.8.jar net.hedges.mimeinfo.ui.Test /path/to/file.ext
All comments and feedback welcome as well as patches and bug reports
Technorati Tags: mime, shared mime info, freedesktop
November 8th, 2006 at 2:04 pm
What about jmimemagic?
That project could use some help actually:
http://sourceforge.net/projects/jmimemagic/
Btw, I had to register to be able to post a comment. The captcha stuff doesn’t seem to work. There is no image.
November 8th, 2006 at 2:33 pm
Yes, I looked at jmimemagic and Sun’s activation framework too. However neither support many file formats and neither is based on a standard with as much support as Shared Mime Info. The Shared Mime Info database supports several hundred file formats and 30 natural languages and is growing.
BTW captcha stuff worked just now when I tested it. The image takes a little while to load, slow server I’m afraid.
November 8th, 2006 at 4:58 pm
One has to wonder ,why the oro dependency if you’re using jdk 5?
Otherwise, good idea!
November 8th, 2006 at 11:49 pm
@Hani
Java does not support globbing, which is different from the Perl5-like regular expressions support by java.util.regex.Pattern since Java 1.4 - unless I’m missing something, which is entirely possible.
December 6th, 2006 at 9:22 am
I have been using JMimeMagic, but it seems to have been abandoned. There are a couple of nice things about it. First, it is easy to prune the magic rules file down to only the mime types that you are concerned with, and reorder it, which greatly speeds up an attempted match. Secondly, it is easy to add your own magic rules for esoteric file types.
Good luck.
December 6th, 2006 at 9:36 am
@Rod
Yes, jmimemagic doesn’t seem to have had an update for a while. I’m planning on adding the database manipulation features so hopefully that will useful to you. However there are a few other things to do first like being able to return the file meta info for each type.
If anyone gets there before me with the database manipulation please feel free to send in the patches.
You can currently create any database you like as you described by editing the XML master file and running the database compile program from freedesktop.org.
December 18th, 2006 at 11:31 pm
Great work! I’d love to see a URL or inputstream method as well for getting mime types on externally hosted files.
December 20th, 2006 at 5:05 am
Actually jMimeMagic isn’t abandoned, although it was slow going for awhile. I just released 0.1.0 a couple months ago which added some hinting for extensions to greatly speed up matching and now supports customer matchers.
And yeah, I know folks want to see more matchers out of the box. I’m gonna have to dig in again and fix my converter for the unix magic file and get a more comprehensive set of matchers out of the box. Would love to see matcher contributions and suggestions get fed into jMimeMagic.
Cheers
January 23rd, 2007 at 4:15 am
Thanks for creating and publishing the jmimeinfo package. I am using it and it is very helpful. I think I found a bug. Perhaps it’s just in the way I am using it but it seems that within Rule.java inside match(File file), RandomAccessFile raf does not get closed when there is a successful match. This preventing me from deleting files when I find an available but inappropriate mime type. As a result I added a try/finally block to the function. Let me know what you think.
My match function:
public boolean match(File file) throws IOException {
RandomAccessFile raf = new RandomAccessFile(file, “r”);
try {
for (int offset = startOffset; offset
April 12th, 2007 at 11:01 am
Same problem here : files get parsed and identified but not closed properly, so any further action on the file is prohibited :-/
I tried to add the two missing “raf.close()” before both returns in Rule.java (lines 133 and 140) but when I recompiled (using Eclipse and the build script) and then tried to use my own version, there was a NullPointerException : net.hedges.mimeinfo.magic.MagicFile.parseMagic(MagicFile.java:51).
Apparently it needs a MIME-Magic that’s not provided in the archive
Could somebody please create and provide a clean jmimeinfo-0.0.3, or a way to make it ?
April 12th, 2007 at 1:59 pm
Wooops… Sorry about that. I _had_ to place the directory named “jmimeinfo/” inside the Eclipse workspace in order to make Ant work as intended. It works !
April 20th, 2007 at 4:32 pm
Thanks Ryan and KewlCat problem has been fixed and new version 0.0.3 is available above.
May 15th, 2007 at 2:56 pm
Hello,
jmimeinfo works very well, and it is a good tool, but not for PSD files, fingerprint is not correct and jmimeinfo does not recognize PSD files… How to update mime magic file ?
May 16th, 2007 at 10:18 am
for psd files : 2 bugs…
first :
in MagicFile.java function readLine()
byte[] buff = new byte[valueLength];
in.readFully(buff);
line.setValue(buff);
b = in.read();
if (b == ‘&’) {
buff = new byte[valueLength]; // this must be added otherwise value is replaced with mask buffer
in.readFully(buff);
line.setMask(buff);
b = in.read();
}
second:
in Rule.java function Mask()
private void mask(byte[] sample){
for(int i = 0; i
May 23rd, 2007 at 10:58 am
Hi, I’m just looking at jmimeinfo for detecting mime type from blobs as fetched from an own db. Thus I have memory byte arrays, not files. Can I use it anyway ?
Also does the GPL licence imply that we must make all other proprietary application jars available in source ?
Thanks.
May 23rd, 2007 at 11:16 am
Emmanuel, I’ll look into this when I get a minute. Thanks for posting the issue and solution.
May 23rd, 2007 at 11:18 am
Renzo,
It shouldn’t be hard to make the change to the code necessary. If you could send me the changes that would be great! I’ll probably be adding this feature soon but can’t guarantee timescale
As for the license issues you should consult with your legal people.
July 13th, 2007 at 8:53 pm
Sorry, been afk due to some job switching
Yeah, if you provide me with any patches you guys have arimus AT users.sourceforge.net, I can merge in the changes for folks. I only ask that you give me the okay to dual-license any contributions…I may switch to another license (Apache perhaps) at some point here.
As for licensing, I am currently licensing under the LGPL, which is not a viral license and does not require you to release your code at all. It only requires modifications to jmimemagic itself be opened up to those you distribute to.
jMimeMagic can handle both files and byte[]. Currently streams are not supported, but that wouldn’t be too hard to add.
September 7th, 2007 at 5:07 am
First - Nice library!
Second - I have users that create Excel files and then later down the road save them with a .csv extension. I know…I know…why they do this is beyond this question. When I use version 0.0.6 on this type of file it reports it as a text/csv. Opening the file in a text editor clearly shows that this is not a csv file. Can you give a reason why it is returning “text/csv” instead of “application/vnd.ms-excel”?
Thanks!
Matt