Yahoo! 360° News | Beta Feedback
Start your own Yahoo! 360° page

I choose not to choose Reply

1 - 5 of 21 First | < Prev | Next > | Last

iNeil mini Full Post View | List View

Anything that comes to mind!

Spellbound - 3 - You probably made a hash of your tries
So now that you have an idea of what problems plague a spell checker, you need to worry about how to solve those problems.

Before I get into this, it might be a good idea to read John Bentley's write up on how McIlroy wrote a spell checker (acm pdf, google booksearch) which is included in the book "Programming Pearls" (I can't recommend that book enough). The approach taken by McIlroy was quite brilliant
but the spell program can only highlight misspelt words and can't really suggest alternatives to the same. While being able to point out the misspelling is no mean feat, we have moved on (wrt resources available at our disposal) quite a bit since McIlroy's time and any self respecting spell check program today needs to be able to suggest alternatives as well!

What needs to be kept in mind is that in any spell check the trade-off lies in 2 place
  1. Choosing how many distinct strings you need to keep in memory to be able to suggest alternatives.
    Some approaches prefer storing base lexemes (run, runs, ran and running are forms of the same lexeme, conventionally written as RUN - wikipedia) and a seprate prefix/postfix list.
    Other approaches insist on storing all variations of the same word.

  2. How you go about storing the distinct strings.
    Do you want to hash the strings, or store them in trie like datastrcutures
This has ramifications for both retrieval times memory requirements. (to be continued.)
Friday August 8, 2008 - 04:26am (PDT) Permanent Link | 0 Comments
Spellbound - 2 - Dot your i's and cross your t's
Spellbound - 2 - Dot your i's and cross your t's magnify
(This is part two of a multi-part essay on something that I have been working on)


Having overcome my initial trepidation about the imsurmountability of the problem, I started looking at the various research papers that discussed spellchecks and how to implement them.
I had to first understand the kind of problems we were trying to solve. To start with, spellings are categorized based on types of mistakes. These are broadly
  1. Insertion, e.g. replacing helllo for hello
  2. Deletion, e.g. replacing helo for hello
  3. Substitution, e.g. replacing heIIo for hello (Capital I's replacing small l's)
  4. Swaps, e.g. replacing ehllo for hello
Spellings mistakes are often categorized based on their origin
  • typing often causes substitutions with nearby by keys, e.g pem for pen
  • some people type out words as they would pronounce them, e.g. skool for school. This also happens for speech to text softwares.
  • OCRs often mistake similar looking characters for each other, e.g. fa11 for fall (1's replacing l's). Something similar happens in leetspeak as well
If you come to OCRs, the types of mistakes are
  • Font issues - Some fonts (especially the calligraphic ones) are much harder for OCRs to correctly decipher. This is even without bringing handwriting recognition to the equation.
  • Glyphs to character conversions - Glyph sequences are often badly construed, for example , mississippi can very easily become nnjssis5lppi or any other similar combination.
  • Visual misrepresentation - Even if the glyphs are correctly understood, sometimes, the context is so confusing that there is no way to know what the right answer, e.g. its impossible to know if hell0123@yahoo.com is hell +0123 or hello+123
Add to this potboiler, the unique problems specific to Indian languages, and you have a hitchcockian mystery. For people who dont know how Indian languages work, here are some bits to chew upon.
  1. Even though Indian language scripts go left to right , some times, in some languages like Hindi, some characters that are stored after other characters appear ahead of them. See example 1
  2. Some times, some characters only appear below or above other characters. See example 2 and 3
  3. Some times, two or more character become conjoined and appear in completely different forms. See example 4.

I must admit that even at this point, knowing the beast did not help me any bit in being less circumspect about it!
Tags: spellcheck, spellboundj, indraneel, indraneelsikdar, indraneil, hindi
Sunday December 16, 2007 - 10:27am (PST) Permanent Link | 0 Comments
Spellbound - 1- Into the frying pan!
(This is part one of a multi-part essay on something that I have been working on)

The people doing research under him, refer to Prof. CVJ as Jaws. The name very aptly reminds me of the movie. As a kid, I was scared by it. As a full grown adult, CVJ still gives me the chills.

But there I was, standing outside his ridiculously cold office hoping to get his audience. He is primarily interested in Computer Vision and Machine learning but had taken a course on programming for us M.Tech students. To this day, I shudder to think how much he made us bleed on that one course. However, I needed extra credits to pass in time and I was hoping he could give me a project that would fetch me the extra credits.

Nothing too complicated I hoped. Looking back, I can’t imagine why I ever thought so!

The project CVJ had to offer blew my mind. He was working on this grand project called Indian language OCR. As part of a massive project akin to Google Book Search, the Govt. of India was on a mission to digitize huge number of books in Indian languages. IIIT-H was part of the project to scan those books. However the real value of such an exercise is to be able to search through the text, for which it is important to extract the text from the scanned images. Hence an Indian language OCR.

There is just one problem - Its impossible to have an Indian Language OCR

This is because there were way too many Indian languages all doing their own thing (and the thing included a different script – and hence different unicode ranges and different fonts, a different grammar etc.).
There were other people who were trying their hand at this problem. Most of them have come to the conclusion that they were better off creating OCRs for specific languages (eg, ISI Kolkata was working on a Bengali OCR and IIT Madras was working on a Tamil OCR)

I don’t think CVJ ever stopped to consider any of these things. He had an entire team of people working on this problem for several years. They had painstakingly built a system that with some human inputs could do as a multi-language OCR.

The multi-language OCR did a average job of most languages. As he explained, if an OCR was right 90% of the time (which means that it got 9 out of 10 characters right), its still not good enough. Heck, discounting the spaces, I had written 2150 characters and 441 words by the end of this sentence (all in one page). That means the given OCR would have made 215 mistakes already.

A typist could do an order of magnitude better job on this than the program, which then simply was not good enough.

Hence, he needed me to improve the performance of the OCR, by building a multi-language spell checker on top of the multi-language OCR!!!

The spell checker was to validate the words suggested by the OCR and where ever necessary correct them.


I would have preferred to be hit by a 100 tonne truck!


Tags: spellbound, cvj, ineil, indraneelsikdar, indraneil
Tuesday November 6, 2007 - 05:02am (PST) Permanent Link | 0 Comments
Press any key to exit this chaos
Press any key to exit this chaos magnify
In August, my wife (Debi) traveled to US. She brought back a laptop for her sister. It was a 14" HP DV25o0t with vista home premium preloaded.
My sis-in-law wanted an HP and though none of us had ever used Vista, we had no choice since HP had no XP laptop to offer on the home user segment

Back in India, Debi handed it over to me to play with before her sis picked it up from us. I was really impressed with what I saw. The laptop was comfortable to use and Vista did ok with the rest of the software I loaded onto it (some IM, some music, some freeware etc.)
It also did great with the optical mouse and pen drives I picked up for my sis-in-law.

And then the pain started. Debi called up to tell me that she could not log in to the laptop since the keyboard was acting up. I was mystified since it had not happened for me. After struggling with various key combinations for an hour or so, she finally got in.
She found that every once in a while, after booting, the laptop starts acting as if the CTRL key is pressed down.

I searched the net to find HP laptops have had a history of keyboard problems.
Solutions range from reinstalling the OS, the bootloader to entering a bizarre sequence of keystrokes. My sis-in-law was inconsolable and HP India was of no help either! (we did not have a warranty that covered India)

I contemplated on our options. We could
  • buy an external keyboard and see if that works
  • actually try and see if reloading the OS/Bootloader helps
  • keep looking for the specific key presses that unlock the keyboard
  • try and substitute the keyboard with some other way to enter text
I sat up one night and decided that if it was possible to create a virtual keyboard on the monitor, such that key presses were simulated by clicking the mouse, and text generated was made available via some textbox to copied and pasted for use elsewhere, we might be able to help out my sis-in-law in the short run.

I built a javascript based application. Javascript looked like the best choice since the laptop had no programming software installed, but it did have a browser preinstalled. Also asking my sis-in-law to do anything complicated would defeat the purpose (of keeping it painless!)
The demo of what I built can be seen here.
The code took me like 5 hours to build (even when I am a javascript noob!), thanks to the internet. After using the same, I can see why the mouse will never be my 1st choice for input device Its so very slow!!!!

What I found more difficult was setting up this demo. I needed to find
  1. a screen recording software
  2. a video editing software
  3. a video hosting software
I chose Camstudio for the 1st one and Jumpcut for the other two since it clubbed them both into one. I am quite pleased with camstudio and am a trifle bewildered with Jumpcut. However, I will keep fiddling around a bit more.
If this goes through fine, who knows, I might do more such demos!

Updated to add: Debi reports that she found something weird. Supposedly, if you reach the log-off screen on the laptop and then cancel the operation, you end up in a situation where the keyboard starts functioning, but this time, the CTRL key does not work! So you gotta choose one of them! This might be an OS issue after all!

Further updated: In case, there is anyone who could use the software I wrote, feel free to drop me a line. Its not fancy, but it works ok!
Tags: keyboard, mouse, javascript, indraneel, indraneil, indraneelsikdar
Tuesday October 30, 2007 - 12:05am (PDT) Permanent Link | 0 Comments
Swallow pride with two spoons of castor oil

Imagine that it’s a bright and sunny day. You wake up feeling good and eager to get back to work. You reach office; look up your emails and BAM!

You got steam rolled, your work - a train wreck!

The reason you felt good in the morning was because you had spent a huge amount of blood, sweat and tears to create that elegant solution to that massively difficult problem. It was a problem unvanquished by other stalwarts. No one had any idea really how to go about it.

You however were sure there was a way out. You harangued the problem to death, until a solution was obvious. It was a piece of art, so elegant, so sparse and yet so robust! You had every right to feel proud as you sent out the code with a mail that explained it to the rest of the people. That was yesterday…

Today, you are standing in disbelief. Some one sitting high up on a mountain smote down your magnum opus with a careless flick. He took like 10 minutes to write a small and brutal mail that rejects your work as incompetent!

Two weeks down the line, you feel your head would explode! Some cretin has implemented something that looks like an uncanny copy of what you had done earlier. Infact it’s an imperfect copy, a subset if you will. This time that same guy sitting in the mountain has accepted this as a good piece of work. This johnny-come-lately is running away with your work!!!

Now stop imagining. How often do you think this happens?


Looks like something similar happened to Con Kolivas .

He is an Australian doctor who is a long time contributor to the Linux kernel. He was interested in making Linux do better on personal computers. He realised that one of the problems was that Linux projects were sponsored by large corporations interested in huge servers. This meant that code was optimized for servers at the expense of desktop performance. One major bottleneck was process-scheduling algorithms, and he spent tonnes of energy to work on this problem and build a scheduler called Staircase deadline (SD).

Unfortunately, Linus was not so impressed with Con’s work.

Some days down the line, Ingo Molnar another kernel hacker wrote Completely Fair Scheduler (CFS) which takes a lot of ideas from SD and Linus accepted that as a good fix.

In the world of linux kernel, what Linus says, goes! Linus of course had his reasons .
Not every one agrees with Linus’ arguments though.

Hurt and defeated, Con has decided to quit working on kernel enhancements. He gave a detailed interview explaining what is wrong with the current state of affairs.

Supposedly, some time back, something similar happened on the device manager front, when udev pushed out devfs and Richard Gooch, the developer for devfs stopped contributing to the kernel forever.

This is of course not to say that such things happen only in the confused world of open-source. While in the open-source world, its out it the open, I guess it happens every where.


It happened to me as well. Not so long ago, we had a problem of substantial complexity. It took me 4 months to come up with a solution that I knew worked. Early prototype results were very promising. In my case, the big guys not just wrote off my solution, they proposed a solution of their own.

This solution was a subset of my solution. My solution was rejected as overly complex. Infact they decreased the complexity of the problem by decreasing the scope of the problem.

Some of the most interesting parts of my solution were trimmed off for something that’s better understood in our team. This brought down the performance but became more maintainable.

To add insult to injury, I was asked to implement the same (you don’t expect big shots to do the spade work, do you?). For close to a week, I fumed, but to no avail.

I eventually went ahead and implemented the solution against my better insticts. The solution did not rebound on us, but today needs several enhancements to support the things we left out then.

I sometimes wonder what went wrong.

  • Was I not forceful enough when it came to emphasizing what I saw as strengths of my design and weaknesses of the 2nd design?
  • Did I lack benchmarks that would have proved my points better?
  • Did I in any way rub the people who mattered the wrong way, thus raising their hackles?
  • Did I actually over engineer the problem and thus forcing the rest of the guys to cut it down to size, perhaps a tad badly?

I think, it was mix of all of these. The mess hurt for a while, but I moved on. I can’t afford to stop working over ego issues. However, I am still looking to understand how to avoid recurreces of such incidents.

ps: I hope Con Kolivas goes back to hacking, as does Gooch!

ps2: Shameless plug - my website!

Add iNeil mini to your personalized My Yahoo! page:

Add to My Yahoo!RSS About My Yahoo! & RSS
1 - 5 of 21 First | < Prev | Next > | Last