« Darwin theory | Main | Digitalizing an Old Book. II »

June 23, 2007

Digitalizing an Old Book

I have a book published in 1985 in LNM, and it is basically a xerox copy of the manuscript, with typewritten text and handwritten formulae. It was 242 pp. So I decided to digitalize it. Previously I copied my book each time covering an opposite page so each page contained no parasite text. Rather unpleasant but unavoidable job. Then I scanned it to Acrobat 7. Unfortunately ADF on my scanner broke and ADF at library gone long ago, so the job was not extremely pleasant, especially because Acrobat each time previewed page and automatically determined if it is BW document, or BW picture, or text/in-line art (and I listed only choices were made). The result was 56 MB file.

Then I ran OCR which increased it slightly and also OCR was very timid: certain clearly pieces of text were not OCRed. I tried ReadIris 9 (which IMHO is a superior OCR s/w) but it was too aggressive and tried to OCR even formulae replacing unrecognized characters by ~. Not good.

Converting document to B/W, cropping out margins and setting compatibility level only with pdf 1.6 (aka Acrobat 7) I reduced document drastically to 11 MB but it could not be handled by earlier Acrobat or by Ghostscript 8.51 and thus was not usable by itself for further transformations. And it was poorly OCRed and not very nicely looking. I converted it to 17 MB postscript file using Acrobat 7.I  converted then ps  to djvu using djvulibre converters installed on my Mac but there is no OCR. So instead I used a trial version of LizardTech Djvu Document Express Pro = Djvu Editor Pro 5.0. It was many hours job! However in the end I got 14 MB djvu document which was better looking than the original pdf and had much supeior OCR (I think LizardTech uses ReadIris OCR engine). I also inserted clickable links into the table of content using the same Djvu Editor Pro. 

All this was wrong approach. Later I did a correct job and the current digitalization is the result of this better approach.

Posted by Victor at June 23, 2007 03:29 AM