Archive for September, 2005

New Statistics Package

Friday, September 30th, 2005

I use awstats for my site stats and it has been able to cope with the increase in traffic less and less well. It’s getting to the point that I can’t run it anymore as it uses half my servers memory and 80% of the CPU causing everything else to suffer. Are there any stats packages out there that can cope with processing more data? I mean free ones.

Update 1: awstats just managed to starve every other process of resource so much so that it cause PostgreSQL and Tomcat to crash…
Update 2: it would seem awstats’ problem was down to it being the end of the month - as it deals with it’s stats in a monthly fashion the build up of data had managed to bring it to a crawl - it’s still not ideal but gives me something to investigate to try and improve matters.

Hibernate Success

Thursday, September 29th, 2005

So after all my tweaking did my Hibernate effort pay off? YES! Here is some log output of my production server (sounds a lot grander than it is) when I turned on DEBUG for Hibernate management code:


getMostRecentlyPublished took 54 millis
getAllCategories took 3 millis
getPublishedBefore took 19 millis
getPublishedBefore took 4 millis
getMostRecentlyPublished took 85 millis
getAllCategories took 4 millis
getMostRecentlyPublished took 55 millis
getPublishedBefore took 2 millis
getPublishedBefore took 5 millis
getAllCategories took 4 millis
getMostRecentlyPublished took 54 millis
getPublishedBefore took 69 millis
getPublishedBefore took 26 millis
getAllCategories took 3 millis
getMostRecentlyPublished took 86 millis
getPublishedBefore took 4 millis
getPublishedBefore took 4 millis
getAllCategories took 3 millis
getMostRecentlyPublished took 56 millis
getPageOfPublishedArticlesByCategory(Law, 0) took 86 millis
getAllCategories took 3 millis
getMostRecentlyPublished took 57 millis
getPublishedBefore took 59 millis
getPublishedBefore took 37 millis
getAllCategories took 4 millis
getMostRecentlyPublished took 58 millis

I’m amazed quite how fast some of these queries run considering the machine is just a single processor pentium 3 with 384M of RAM. Most of these queries are running 20x faster some are running 100x faster. Load average on the machine has dropped from about 4 to around 0.6 [load average is the number of processes competing for the CPU averaged over a window of time - usually 1 minute, 5 minutes and 15 minutes].

Now if I could get my hands on some proper hardware I could try out some funky JBoss Cache cacheing.

Harvey Danger

Thursday, September 29th, 2005

I just downloaded an album of the Interweb using BitTorrent without paying for it! I’m listening to it now - it’s not too bad.

STOP, before you call the authorities and have me hauled in front of a court of record company executives let me explain. The artists involved Harvey Danger are behind the free release of their album. They have decided to release a high quality, DRM unencumbered, version of their album ‘Little By Little’ on MP3 and Ogg. They have choosen to use BitTorrent as the distribution means, presumable to reduce their bandwidth bills. They do ask if you like the album to either donate on their website or buy the old school physical CD in a music shop (how olde worlde).

Slowly but surely it seems the record industry, or at least the artists, are joining the 21st century. Consumers don’t want DRM and most intelligent people are sick of being told what to like by radio stations and MTV.

Read about why they decided to do this in their own words.

Zimbra

Thursday, September 29th, 2005

Anyone who has suffered the abject misery of using Microsoft’s Outlook Web Access will be interesting in this new, Open Source offering - Zimbra. Zimbra uses AJAX among other things to provide a rich groupware application (email, calendar, tasks and so on) direct to your browser. The server side component also supports the open standards IMAP, POP3 and iCAL as well as Microsoft’s MAPI (whatever).

A full demo is on their website.

Some Hibernate Optimization Rules of Thumb

Sunday, September 25th, 2005

So I decided to have a go at further optimizing The Humor Archive Hibernate code. First thing I did was to investigate Query Caching. Query Caching is different to the 2nd level cache in Hibernate (which defaults to EHCache which I have blogged about before) in that the Query Cache keeps a store of previous queries and the results return. Internally it is not dissimilar to a HashMap keyed on the SQL Query String and valued on the object graph returned. Although it is slightly cleverer than this as it knows when to update stored object graphs when other Hibernate queries modify the object within said graph.

So I implemented Query Cacheing for some of the more expensive queries on the site and set some timers up on the code to let me know how long they were taking. One expensive query dropped from taking 500 millis down to about 50 millis. This was an enormous win for me.

Now, you maybe thinking that 500 millis ‘that’s a long query’ and you would be right; basically the category pages where returning all the articles within that category (100-500 articles), not only that but as the article–>category relationship was many-to-many, so an article can be in many categories and a category can contain many articles. Many-to-many relationships are notoriously expensive due the fact that there is potentially a cartesian of a cartesian (indexing avoids this) but still it’s not cheap. Compounding this, I must have been in a hurry when I wrote the query, as it used a sub-query. Well actually it was using the elements ‘function’ of Hibernates HQL, which in the PostgreSQL dialect manifests itself as a subquery.

Rewriting this query to use the ‘left join fetch’ mechanism sped the query up from around 1000 millis (yeah, I know) to a more reasonable 150 millis (still too slow in my book).

So back to the huge amount of articles being returned. I decided, or rather got around to, implementing pagination (pagination is putting a list of things onto many pages and listing the page numbers at the bottom - like search engines do - goooooogle). The pain with pagination is you need to know: the number of pages, the page you are on, whether its the first page or that last page and the number of results on a page. To know the number of pages you need to find the floor of the number of results divided by the number of results per page. You could implement this using two queries, one to count the number of results and one to return the page (using offsets and limits in PostgreSQL). However it’s possible you use a Scrollable result set to do this - performance is about the same as two queries but code complexity is lower. This scheme I implemented and performance improved again! Now we were down to just 10-20 millis for this query.

Interestingly, the list of articles on the homepage don’t need to be joined with the categories and so the query is a lot simpler. They do however need to contain attachment (a one-to-many relationship). Firstly, I though that the left join fetch would give me a speed up - it did with the category queries. However it actually slowed the query down. To understand why we must understand how Hibernate works. If we have an object that has an associate list as a property hibernate by default queries the object and then does a separate query for each item in the associated list of objects. So if we have an article with 5 attachments it will do a query to return the article and the ids of all the attachments and then it will do 5 queries for each attachment.

Now this behaviour can be circumvented by using the ‘left join fetch’ mechanism mention above. This way Hibernate only does one query with, you’ve guessed it, a left join; this will have only one round trip to the database (network IO is the bottleneck usually). So why isn’t this faster than the default multi query method. Well, as I was using the Query Cache it seems that a large query with a large object graph (i.e. the left join fetch) was slower to be drawn from the Query Cache than the a set of smaller queries. My empirical, unscientific, evidence suggests a 2x difference when using the Query Cache.

So in short here are the rules of thumb:

- If you have often repeated queries use the query cache
- If you are using the query cache and you have a one-to-many you will probably be best not using the ‘left join fetch’
- If you have a many-to-many then try the left join fetch method, it should be an improvement even with the query cache
- Scrollable results sets don’t have much advantage over two queries

As with any optimization work your mileage will vary. All applications are different but I hope this has give you some ideas of what you can play with.

This was meant to be a short entry and look what happened - a long rambly entry with no firm conclusion.

Innocent in London

Thursday, September 22nd, 2005

This is ridiculous. Of course it is the politicians that are to blame, the police are just doing their job and appear to be fairly embarassed by the whole thing.

meebo.com

Friday, September 16th, 2005

Found this great website (although its still in alpha) named meebo.com. Basically what it is is an web based Instant Messaging client. It supports MSN, AIM, ICQ and Yahoo Messenger protocols which is all the major ones excluding Jabber and Google Talk. The beauty of it is that it uses AJAX to give a rich user experience. So you get realtime feedback on who’s online what messages people are sending you and so on. It using standard windowing metaphores and so you can move the messager client and associate windows around the page as you desire.

Anyway, its cool, try it!

RSS Feeds

Thursday, September 15th, 2005

If the hype around RSS and Blogging wasn’t already insane it seems to changing up a gear. First off it was the launch of Google’s blog search then the mysterious Flock Browser which seems to have integration for just about every piece of blog software out there as well as Flickr support.

I added RSS support to The Humor Archives about 6 months ago and to start with the feed didn’t get too much traffic - perhaps a couple tens of requests a day, but this month things have gone mad with over 7,000 in 15 days. This is a good thing ™ but a bit of a surprise.

So anyway, I’ve been collecting up as many good RSS feeds as I can. Let me know your favs…