In my previous post I had discussed that Tagcloud was unable to generate the Tagcloud for Hindi blogdom since the Yahoo term extraction API doesn't recognize non-English characters yet. I then decided on implementing it myself. This is how I do it:

  1. Parse the Hindi blog group RSS Feed
  2. Get the Words ignoring the very commonly used ones
  3. Insert the data (word and frequency of occurrence) in database

The data from database is available as XML and as this JSP (see the frame below for a glimpse). If you want to see the page in action, look out for the “Kya bolte hum?” section at Chittha Vishwa.

Sad part is, the solution only considers words and is not intelligent enough to decipher phrases. Perhaps this is why I was happy to notice that Technorati now provides a Blog Post Tags service where the query returns the most frequently used tags in a blog. However, for some strange reasons the query never works out for me, I tried the same for this blog as well several other Hindi blogs but the XML returned is always empty. First, I thought it only works for WordPress blogs or blogs that use Technorati Tags, however for some blogs like this one it works. A missive to Technorati did not fetch any reply; their blog post OTOH indicates that the service is “available only on request”. I hope they read this post and tell me what is happening. If thier solution works, my tagcloud could perhaps be generated more efficiently.