Archive for the ‘Java’ Category

Some Hibernate Optimization Rules of Thumb

Sunday, September 25th, 2005

So I decided to have a go at further optimizing The Humor Archive Hibernate code. First thing I did was to investigate Query Caching. Query Caching is different to the 2nd level cache in Hibernate (which defaults to EHCache which I have blogged about before) in that the Query Cache keeps a store of previous queries and the results return. Internally it is not dissimilar to a HashMap keyed on the SQL Query String and valued on the object graph returned. Although it is slightly cleverer than this as it knows when to update stored object graphs when other Hibernate queries modify the object within said graph.

So I implemented Query Cacheing for some of the more expensive queries on the site and set some timers up on the code to let me know how long they were taking. One expensive query dropped from taking 500 millis down to about 50 millis. This was an enormous win for me.

Now, you maybe thinking that 500 millis ‘that’s a long query’ and you would be right; basically the category pages where returning all the articles within that category (100-500 articles), not only that but as the article–>category relationship was many-to-many, so an article can be in many categories and a category can contain many articles. Many-to-many relationships are notoriously expensive due the fact that there is potentially a cartesian of a cartesian (indexing avoids this) but still it’s not cheap. Compounding this, I must have been in a hurry when I wrote the query, as it used a sub-query. Well actually it was using the elements ‘function’ of Hibernates HQL, which in the PostgreSQL dialect manifests itself as a subquery.

Rewriting this query to use the ‘left join fetch’ mechanism sped the query up from around 1000 millis (yeah, I know) to a more reasonable 150 millis (still too slow in my book).

So back to the huge amount of articles being returned. I decided, or rather got around to, implementing pagination (pagination is putting a list of things onto many pages and listing the page numbers at the bottom - like search engines do - goooooogle). The pain with pagination is you need to know: the number of pages, the page you are on, whether its the first page or that last page and the number of results on a page. To know the number of pages you need to find the floor of the number of results divided by the number of results per page. You could implement this using two queries, one to count the number of results and one to return the page (using offsets and limits in PostgreSQL). However it’s possible you use a Scrollable result set to do this - performance is about the same as two queries but code complexity is lower. This scheme I implemented and performance improved again! Now we were down to just 10-20 millis for this query.

Interestingly, the list of articles on the homepage don’t need to be joined with the categories and so the query is a lot simpler. They do however need to contain attachment (a one-to-many relationship). Firstly, I though that the left join fetch would give me a speed up - it did with the category queries. However it actually slowed the query down. To understand why we must understand how Hibernate works. If we have an object that has an associate list as a property hibernate by default queries the object and then does a separate query for each item in the associated list of objects. So if we have an article with 5 attachments it will do a query to return the article and the ids of all the attachments and then it will do 5 queries for each attachment.

Now this behaviour can be circumvented by using the ‘left join fetch’ mechanism mention above. This way Hibernate only does one query with, you’ve guessed it, a left join; this will have only one round trip to the database (network IO is the bottleneck usually). So why isn’t this faster than the default multi query method. Well, as I was using the Query Cache it seems that a large query with a large object graph (i.e. the left join fetch) was slower to be drawn from the Query Cache than the a set of smaller queries. My empirical, unscientific, evidence suggests a 2x difference when using the Query Cache.

So in short here are the rules of thumb:

- If you have often repeated queries use the query cache
- If you are using the query cache and you have a one-to-many you will probably be best not using the ‘left join fetch’
- If you have a many-to-many then try the left join fetch method, it should be an improvement even with the query cache
- Scrollable results sets don’t have much advantage over two queries

As with any optimization work your mileage will vary. All applications are different but I hope this has give you some ideas of what you can play with.

This was meant to be a short entry and look what happened - a long rambly entry with no firm conclusion.

The Humor Archives

Saturday, March 12th, 2005

So I have finally got around to updating my oldest (surviving) website The Humor Archives. This website has been around in some shape or form for 10 years and has up until two days ago been running on the same back end code; written in Perl with a hand rolled database (yes, I wrote my own database for it).

So what did I use to rewrite the site? Well to understand my decisions we have to look first at the none functional requirements of the site. It must be able to run on a single processor Pentium III with 384Mb of main memory, not only that but the sites operation must not negatively impact the other 20 or so sites running on the same box! Further the redevelopment must not take me more that a week of evenings to complete. As a final requirement all the sites URL must remain the same (I’ll come back to this).

OK, these requirements seem fairly doable but what volume of traffic do we have to serve? To be honest it’s not a super high traffic website; the figures break down like this: daily visitors: 6-7,000; page requests: 70-80,000. These are averages and so as a quick thumb in the air guess I would say about 30 maybe 40 concurrent user sessions at peak.

So with these requirements I obviously choose Java as the base technology, firstly because I know it well but more importantly Java provides me with a means of limiting the impact of this site on the overall performance the physical machine. By limiting heap size and various other jvm teaks I can get the most out of the box by pooling and reusing resources. The pooling and reuse starts at Sun J2SE 1.5’s super cool garbage collection and general optimisation and works up through Apache Commons DBHP to Hibernate’s EH Cache and beyond - it all works transparently and it works well.

First of all I started with a proof of concept using Tapestry (3.0.1), which I have to say I really liked, the component model is fantastic and something that I hope other frameworks will learn from. However there were a number of things I wasn’t completely happy with. Firstly I found the configuration files to be overly complex and under-ly documented, secondly I didn’t like loosing control of some of the lower level concerns such as session management, creation and so forth. For my site to be truely efficient I aimed to have it truely stateless but a number of Tapestry components were using sessions. Secondly I was having to seriously hack the framework to get ‘friendly urls’ or rather the same urls as the old site. Getting the old urls to work was as simple as configuring and installing urlrewrite the problem was getting tapestry to generate within links the old urls. Tapestry for very noble reasons manages all url creation which is a good thing in most cases but not for mine. I understand that Tapestry 3.1 will bring in more support of ‘friendly urls’ but that is currently in Alpha.

I took a look at Spring after my adventure with Tapestry but found the documentation to be lacking and so abandoned that fairly quickly - I also wasn’t liking the monolithic configuration file it seems to favour. So in the interests of timeliness I converted the tapestry site to Struts (you gotta love the ASF). I did think about using JSF but had already spent two evenings farting around with Tapestry and Spring and wanted to get something up and running ASAP.

So to conclude (about time it has take almost as long to write this blog as it did to write the site) these are the technologies in use (roughly from the bottom of the stack to the top):

Oh and I also managed to use CSS for all positioning which was sweet.

Nutch patches accepted

Tuesday, August 31st, 2004

I’m currently spending some of my free time developing patches for Nutch an OSS search engine. Recently a number of these patches were accepted into the main line of development. The patches I submitted do various things from parsing MS Word properties such as title and author to allowing MP3 audio files to be searched. There is still one outstanding patch of mine to be accepted and that is a rewrite of the HTTP fetching routines (the protocol side of the spider/robot).

Nutch is a pretty exciting project in my opinion with big players like Yahoo and Tim O’Reilly (of O’Reilly books fame) participating. Yahoo have a demo of Nutch on their site.