I read about the plan via a tweet. And, good ol’ me is excited. There has been far too much talk about this and that, too many departments within the governments and, too large chunks of grants disbursed to make this happen. Nothing much has happened. If, a collective of interested people willing to invest some of their time get together then this is amazingly simple to achieve.
The problem that a ‘crowd-sourcing’ effort can overcome is the heavy requirement on OCR. The last time I checked, IndicOCR pieces that are available return an accuracy of just 80%. It sounds nice but in reality it is awkward.
Consider this – a 100 page book would have just around 80 pages correctly optically scanned. Most of the target books would be around the 180-200 page mark. That’s a whopping 40 pages of incorrect data and, that requires increased focus on proof-reading. The alternative to the effort is based on the ‘labor is cheap’ concept. In other words, employ enough folks to actually key in the book. Input has become reasonably easy in recent times – there are keyboard layouts for Indian langauges both on the distribution and, there are web-based applications too. It may not ‘scale’ but it could work. Long time back, this was attempted in collaboration with an NGO who were keen on using this opportunity to teach less-privileged folks to learn how to work with computers and, Linux ! It was painstaking but it was worth it.
Either way, I am game for this. I think I can put my name forward to proof-read at least half a dozen books for Bengali this year. And, I know just the person to suggest titles (besides the fact that archive.org does have some surprising out-of-copyright books scanned and stored)