Hindi in that Cloud

In my previous post I had discussed that Tagcloud was unable to generate the Tagcloud for Hindi blogdom since the Yahoo term extraction API doesn't recognize non-English characters yet. I then decided on implementing it myself. This is how I do it:

  1. Parse the Hindi blog group RSS Feed
  2. Get the Words ignoring the very commonly used ones
  3. Insert the data (word and frequency of occurrence) in database

The data from database is available as XML and as this JSP (see the frame below for a glimpse). If you want to see the page in action, look out for the “Kya bolte hum?” section at Chittha Vishwa.

Sad part is, the solution only considers words and is not intelligent enough to decipher phrases. Perhaps this is why I was happy to notice that Technorati now provides a Blog Post Tags service where the query returns the most frequently used tags in a blog. However, for some strange reasons the query never works out for me, I tried the same for this blog as well several other Hindi blogs but the XML returned is always empty. First, I thought it only works for WordPress blogs or blogs that use Technorati Tags, however for some blogs like this one it works. A missive to Technorati did not fetch any reply; their blog post OTOH indicates that the service is “available only on request”. I hope they read this post and tell me what is happening. If thier solution works, my tagcloud could perhaps be generated more efficiently.

TagCloud, if it could work with Hindi

Tagcloud seems interesting; tells you the crux of the conversations in blogdom pretty much like Technorati tags. There are 80+ Hindi blogs now and I thought why not sport one such cloud for these blogs at Chittha Vishwa, alas the effort failed. The onus fell on the Yahoo Term Extraction API that as of now only recognizes English words. The speed with which terms are extracted from even large amount of text, I tested it here. Try entering some Hindi text and as you may see it is unable to recognize the terms. The thing I like is the simplicity of REST. BTW, do have a look at the Tagcloud from my Hindi as well as this blog (atleast it will get the English words from my Hindi blog) on the left sidebar.

While I intended to do it for pure fun, as did Desipundit, people are scepticle about the usefulness of these tags. While Tagcloud ranks these tags to prepare the cloud, Simon used it to extract terms for “automated tagging”, though the results are not guaranteed to be relevent.

Hindi blogs showcased on DesiPundit

Patrix, who wanted the group blog DesiPundit to have some desi flavor true to its name, was kind enough to invite me to be a co-blogger. I was keen on Anup doing it as he reads more Hindi blogs than I do and manages Chittha Charcha a monthly round up on the major happenings in Indian blogdom. He was reluctant but assured me his support. That prompted me to join the wagon. With blogging on my blogs in a dismal shape, I hope I would be able to bring forth the best of Hindi blogosphere through DesiPundit.