So I decided to have a go at further optimizing The Humor Archive Hibernate code. First thing I did was to investigate Query Caching. Query Caching is different to the 2nd level cache in Hibernate (which defaults to EHCache which I have blogged about before) in that the Query Cache keeps a store of previous queries and the results return. Internally it is not dissimilar to a HashMap keyed on the SQL Query String and valued on the object graph returned. Although it is slightly cleverer than this as it knows when to update stored object graphs when other Hibernate queries modify the object within said graph.
So I implemented Query Cacheing for some of the more expensive queries on the site and set some timers up on the code to let me know how long they were taking. One expensive query dropped from taking 500 millis down to about 50 millis. This was an enormous win for me.
Now, you maybe thinking that 500 millis ‘that’s a long query’ and you would be right; basically the category pages where returning all the articles within that category (100-500 articles), not only that but as the article–>category relationship was many-to-many, so an article can be in many categories and a category can contain many articles. Many-to-many relationships are notoriously expensive due the fact that there is potentially a cartesian of a cartesian (indexing avoids this) but still it’s not cheap. Compounding this, I must have been in a hurry when I wrote the query, as it used a sub-query. Well actually it was using the elements ‘function’ of Hibernates HQL, which in the PostgreSQL dialect manifests itself as a subquery.
Rewriting this query to use the ‘left join fetch’ mechanism sped the query up from around 1000 millis (yeah, I know) to a more reasonable 150 millis (still too slow in my book).
So back to the huge amount of articles being returned. I decided, or rather got around to, implementing pagination (pagination is putting a list of things onto many pages and listing the page numbers at the bottom - like search engines do - goooooogle). The pain with pagination is you need to know: the number of pages, the page you are on, whether its the first page or that last page and the number of results on a page. To know the number of pages you need to find the floor of the number of results divided by the number of results per page. You could implement this using two queries, one to count the number of results and one to return the page (using offsets and limits in PostgreSQL). However it’s possible you use a Scrollable result set to do this - performance is about the same as two queries but code complexity is lower. This scheme I implemented and performance improved again! Now we were down to just 10-20 millis for this query.
Interestingly, the list of articles on the homepage don’t need to be joined with the categories and so the query is a lot simpler. They do however need to contain attachment (a one-to-many relationship). Firstly, I though that the left join fetch would give me a speed up - it did with the category queries. However it actually slowed the query down. To understand why we must understand how Hibernate works. If we have an object that has an associate list as a property hibernate by default queries the object and then does a separate query for each item in the associated list of objects. So if we have an article with 5 attachments it will do a query to return the article and the ids of all the attachments and then it will do 5 queries for each attachment.
Now this behaviour can be circumvented by using the ‘left join fetch’ mechanism mention above. This way Hibernate only does one query with, you’ve guessed it, a left join; this will have only one round trip to the database (network IO is the bottleneck usually). So why isn’t this faster than the default multi query method. Well, as I was using the Query Cache it seems that a large query with a large object graph (i.e. the left join fetch) was slower to be drawn from the Query Cache than the a set of smaller queries. My empirical, unscientific, evidence suggests a 2x difference when using the Query Cache.
So in short here are the rules of thumb:
- If you have often repeated queries use the query cache
- If you are using the query cache and you have a one-to-many you will probably be best not using the ‘left join fetch’
- If you have a many-to-many then try the left join fetch method, it should be an improvement even with the query cache
- Scrollable results sets don’t have much advantage over two queries
As with any optimization work your mileage will vary. All applications are different but I hope this has give you some ideas of what you can play with.
This was meant to be a short entry and look what happened - a long rambly entry with no firm conclusion.